Name	Last Update	Last Commit ff5325fa – 修复：直接在 Searcher 层构建 Pydantic 模型对象，而不是字典。 History
.cursor/plans	Loading commit data...
.github/workflows	Loading commit data...
api	Loading commit data...
config	Loading commit data...
context	Loading commit data...
data/customer1	Loading commit data...
docs	Loading commit data...
embeddings	Loading commit data...
frontend	Loading commit data...
indexer	Loading commit data...
query	Loading commit data...
scripts	Loading commit data...
search	Loading commit data...
tests	Loading commit data...
utils	Loading commit data...
.env	Loading commit data...
.env.example	Loading commit data...
.gitignore	Loading commit data...
API_DOCUMENTATION.md	Loading commit data...
API_EXAMPLES.md	Loading commit data...
API_QUICK_REFERENCE.md	Loading commit data...
BEST_PRACTICES_REFACTORING.md	Loading commit data...
CHANGES.md	Loading commit data...
CLAUDE.md	Loading commit data...
DEPLOYMENT.md	Loading commit data...
DOCUMENTATION_INDEX.md	Loading commit data...
FACETS_FIX_SUMMARY.md	Loading commit data...
FACETS_TEST_REPORT.md	Loading commit data...
FRONTEND_GUIDE.md	Loading commit data...
FRONTEND_UPDATE_V3.1.md	Loading commit data...
HighLevelDesign.md	Loading commit data...
IMPLEMENTATION_COMPLETE.md	Loading commit data...
MIGRATION_GUIDE_V3.md	Loading commit data...
MULTILANG_FEATURE.md	Loading commit data...
QUICKSTART.md	Loading commit data...
QUICK_START.md	Loading commit data...
README.md	Loading commit data...
REFACTORING_SUMMARY.md	Loading commit data...
USER_GUIDE.md	Loading commit data...
VISUAL_COMPARISON.md	Loading commit data...
debug_sort_query.py	Loading commit data...
environment.yml	Loading commit data...
example_usage.py	Loading commit data...
main.py	Loading commit data...
requirements.txt	Loading commit data...
requirements_server.txt	Loading commit data...
restart.sh	Loading commit data...
run.sh	Loading commit data...
setup.sh	Loading commit data...
test_all.sh	Loading commit data...
test_new_api.py	Loading commit data...
verification_report.py	Loading commit data...
verify_refactoring.py	Loading commit data...
商品数据源入ES配置规范.md	Loading commit data...
当前开发进度.md	Loading commit data...
支持多语言查询.md	Loading commit data...
阿里opensearch电商行业.md	Loading commit data...

README.md

E-Commerce Search Engine SaaS

A configurable, multi-tenant search engine for cross-border e-commerce, built for Shoplazza independent sites.

Features

Multi-language Support: Chinese, English, Russian, Arabic, Spanish, Japanese with automatic translation
Semantic Search: Text and image embeddings using BGE-M3 and CN-CLIP models
Hybrid Ranking: Combines BM25 text relevance with semantic similarity
Boolean Operators: Supports AND, OR, RANK, ANDNOT with proper precedence
Configurable: Customer-specific schemas, analyzers, and ranking expressions
Multi-tenant: Each customer has isolated configuration and extension tables
RESTful API: FastAPI-based service with comprehensive endpoints

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Start Elasticsearch

# Using Docker
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
  elasticsearch:8.11.0

3. Ingest Customer1 Test Data

cd data/customer1
python ingest_customer1.py \
  --csv goods_with_pic.5years_congku.csv.shuf.1w \
  --limit 1000 \
  --recreate-index \
  --es-host http://localhost:9200

4. Start API Service

python -m api.app \
  --host 0.0.0.0 \
  --port 8000 \
  --customer customer1 \
  --es-host http://localhost:9200 \
  --reload

5. Test Search

# Simple search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "��", "size": 10}'

# Search with filters
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{
    "query": "toy",
    "size": 10,
    "filters": {"categoryName_keyword": "�w"}
  }'

# Boolean search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "�� AND ( OR doll)", "size": 10}'

# Image search
curl -X POST http://localhost:8000/search/image \
  -H "Content-Type: application/json" \
  -d '{
    "image_url": "https://oss.essa.cn/example.jpg",
    "size": 10
  }'

Project Structure

SearchEngine/
   config/                      # Configuration system
      field_types.py          # Field type definitions
      config_loader.py        # Config loader & validator
      schema/                 # Customer configurations
          customer1_config.yaml
   indexer/                    # Data ingestion
      mapping_generator.py   # ES mapping generator
      data_transformer.py    # Data transformation
      bulk_indexer.py        # Bulk indexing
   query/                      # Query processing
      query_parser.py        # Main query parser
      language_detector.py   # Language detection
      translator.py          # Translation service
      query_rewriter.py      # Query rewriting
   search/                     # Search execution
      searcher.py            # Main searcher
      boolean_parser.py      # Boolean expression parser
      es_query_builder.py    # ES DSL builder
      ranking_engine.py      # Ranking engine
   embeddings/                 # Embedding encoders
      text_encoder.py        # BGE-M3 text encoder
      image_encoder.py       # CN-CLIP image encoder
   utils/                      # Utilities
      db_connector.py        # MySQL connector
      es_client.py           # ES client wrapper
      cache.py               # Embedding cache
   api/                        # REST API
       app.py                 # FastAPI application
       models.py              # Request/response models
       routes/                # API routes
           search.py          # Search endpoints
           admin.py           # Admin endpoints

Configuration

Customer Configuration Example

See config/schema/customer1_config.yaml for a complete example.

Key sections:

fields: Define data fields, types, analyzers, and embedding configuration
indexes: Define query domains (default, title, brand, category)
query_config: Multi-language, translation, embedding settings
ranking: Ranking expression (e.g., bm25() + 0.2*text_embedding_relevance())
spu_config: SPU aggregation settings

Field Types

TEXT: Analyzed text with configurable analyzer
KEYWORD: Exact match keyword
TEXT_EMBEDDING: 1024-dim text vector (BGE-M3)
IMAGE_EMBEDDING: 1024-dim image vector (CN-CLIP)
INT/LONG/FLOAT/DOUBLE: Numeric fields
DATE: Date/timestamp fields

Analyzers

chinese_ecommerce: Ansj Chinese analyzer
english, arabic, spanish, russian, japanese: Language-specific analyzers

API Endpoints

Search

POST /search/ - Text search with filters and facets
POST /search/image - Image similarity search
GET /search/suggestions - Search suggestions (autocomplete, framework only)
GET /search/instant - Instant search (framework only)
GET /search/{doc_id} - Get document by ID

Admin

GET /admin/health - Health check
GET /admin/config - Get configuration
GET /admin/stats - Index statistics
GET /admin/rewrite-rules - Get query rewrite rules
POST /admin/rewrite-rules - Update rewrite rules

New Features (v3.0)

✅ Structured Filters: Separated filters (exact match) and range_filters (numeric ranges)
✅ Faceted Search: Simplified facet configuration with standardized response format
✅ Removed Hardcoded Logic: No more hardcoded price_ranges
✅ Search Suggestions: Framework endpoints for autocomplete (implementation pending)

See API_DOCUMENTATION.md for detailed API documentation.

Advanced Features

Boolean Operators

Operator precedence (high to low):

() - Parentheses
ANDNOT - Exclusion
AND - All must match
OR - Any must match
RANK - Ranking boost

Example: laptop AND (gaming OR professional) ANDNOT cheap

Query Rewriting

Configure brand/category mappings:

rewrite_dictionary:
  "��": "brand:�� OR name:��"
  "�w": "category:�w"

Ranking Expressions

Configurable ranking with functions:

bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)

SPU Aggregation

Enable SPU collapse to show one SKU per SPU:

spu_config:
  enabled: true
  spu_field: "spu_id"
  inner_hits_size: 3

Performance Tips

Embedding Cache: Enable caching to avoid recomputing embeddings
Batch Size: Adjust batch size based on memory (default: 100 for transform, 500 for indexing)
ES Sharding: Configure shards based on cluster size
GPU Acceleration: Use CUDA for faster embedding generation

Development

Run Tests

pytest tests/

Format Code

black .

Type Checking

mypy .

GITLAB

tangwang / SearchEngine