Name Last Update
.cursor/plans Loading commit data...
.github/workflows Loading commit data...
api Loading commit data...
config Loading commit data...
context Loading commit data...
data/customer1 Loading commit data...
docs Loading commit data...
embeddings Loading commit data...
frontend Loading commit data...
indexer Loading commit data...
query Loading commit data...
scripts Loading commit data...
search Loading commit data...
tests Loading commit data...
utils Loading commit data...
.env Loading commit data...
.env.example Loading commit data...
.gitignore Loading commit data...
API_DOCUMENTATION.md Loading commit data...
API_EXAMPLES.md Loading commit data...
API_QUICK_REFERENCE.md Loading commit data...
BEST_PRACTICES_REFACTORING.md Loading commit data...
CHANGES.md Loading commit data...
CLAUDE.md Loading commit data...
DEPLOYMENT.md Loading commit data...
DOCUMENTATION_INDEX.md Loading commit data...
FACETS_FIX_SUMMARY.md Loading commit data...
FACETS_TEST_REPORT.md Loading commit data...
FRONTEND_GUIDE.md Loading commit data...
FRONTEND_UPDATE_V3.1.md Loading commit data...
HighLevelDesign.md Loading commit data...
IMPLEMENTATION_COMPLETE.md Loading commit data...
MIGRATION_GUIDE_V3.md Loading commit data...
MULTILANG_FEATURE.md Loading commit data...
QUICKSTART.md Loading commit data...
QUICK_START.md Loading commit data...
README.md Loading commit data...
REFACTORING_SUMMARY.md Loading commit data...
USER_GUIDE.md Loading commit data...
VISUAL_COMPARISON.md Loading commit data...
debug_sort_query.py Loading commit data...
environment.yml Loading commit data...
example_usage.py Loading commit data...
main.py Loading commit data...
requirements.txt Loading commit data...
requirements_server.txt Loading commit data...
restart.sh Loading commit data...
run.sh Loading commit data...
setup.sh Loading commit data...
test_all.sh Loading commit data...
test_new_api.py Loading commit data...
verification_report.py Loading commit data...
verify_refactoring.py Loading commit data...
商品数据源入ES配置规范.md Loading commit data...
当前开发进度.md Loading commit data...
支持多语言查询.md Loading commit data...
阿里opensearch电商行业.md Loading commit data...

README.md

E-Commerce Search Engine SaaS

A configurable, multi-tenant search engine for cross-border e-commerce, built for Shoplazza independent sites.

Features

  • Multi-language Support: Chinese, English, Russian, Arabic, Spanish, Japanese with automatic translation
  • Semantic Search: Text and image embeddings using BGE-M3 and CN-CLIP models
  • Hybrid Ranking: Combines BM25 text relevance with semantic similarity
  • Boolean Operators: Supports AND, OR, RANK, ANDNOT with proper precedence
  • Configurable: Customer-specific schemas, analyzers, and ranking expressions
  • Multi-tenant: Each customer has isolated configuration and extension tables
  • RESTful API: FastAPI-based service with comprehensive endpoints

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Start Elasticsearch

# Using Docker
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
  elasticsearch:8.11.0

3. Ingest Customer1 Test Data

cd data/customer1
python ingest_customer1.py \
  --csv goods_with_pic.5years_congku.csv.shuf.1w \
  --limit 1000 \
  --recreate-index \
  --es-host http://localhost:9200

4. Start API Service

python -m api.app \
  --host 0.0.0.0 \
  --port 8000 \
  --customer customer1 \
  --es-host http://localhost:9200 \
  --reload
# Simple search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "��", "size": 10}'

# Search with filters
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{
    "query": "toy",
    "size": 10,
    "filters": {"categoryName_keyword": "�w"}
  }'

# Boolean search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "�� AND ( OR doll)", "size": 10}'

# Image search
curl -X POST http://localhost:8000/search/image \
  -H "Content-Type: application/json" \
  -d '{
    "image_url": "https://oss.essa.cn/example.jpg",
    "size": 10
  }'

Project Structure

SearchEngine/
   config/                      # Configuration system
      field_types.py          # Field type definitions
      config_loader.py        # Config loader & validator
      schema/                 # Customer configurations
          customer1_config.yaml
   indexer/                    # Data ingestion
      mapping_generator.py   # ES mapping generator
      data_transformer.py    # Data transformation
      bulk_indexer.py        # Bulk indexing
   query/                      # Query processing
      query_parser.py        # Main query parser
      language_detector.py   # Language detection
      translator.py          # Translation service
      query_rewriter.py      # Query rewriting
   search/                     # Search execution
      searcher.py            # Main searcher
      boolean_parser.py      # Boolean expression parser
      es_query_builder.py    # ES DSL builder
      ranking_engine.py      # Ranking engine
   embeddings/                 # Embedding encoders
      text_encoder.py        # BGE-M3 text encoder
      image_encoder.py       # CN-CLIP image encoder
   utils/                      # Utilities
      db_connector.py        # MySQL connector
      es_client.py           # ES client wrapper
      cache.py               # Embedding cache
   api/                        # REST API
       app.py                 # FastAPI application
       models.py              # Request/response models
       routes/                # API routes
           search.py          # Search endpoints
           admin.py           # Admin endpoints

Configuration

Customer Configuration Example

See config/schema/customer1_config.yaml for a complete example.

Key sections:

  • fields: Define data fields, types, analyzers, and embedding configuration
  • indexes: Define query domains (default, title, brand, category)
  • query_config: Multi-language, translation, embedding settings
  • ranking: Ranking expression (e.g., bm25() + 0.2*text_embedding_relevance())
  • spu_config: SPU aggregation settings

Field Types

  • TEXT: Analyzed text with configurable analyzer
  • KEYWORD: Exact match keyword
  • TEXT_EMBEDDING: 1024-dim text vector (BGE-M3)
  • IMAGE_EMBEDDING: 1024-dim image vector (CN-CLIP)
  • INT/LONG/FLOAT/DOUBLE: Numeric fields
  • DATE: Date/timestamp fields

Analyzers

  • chinese_ecommerce: Ansj Chinese analyzer
  • english, arabic, spanish, russian, japanese: Language-specific analyzers

API Endpoints

  • POST /search/ - Text search with filters and facets
  • POST /search/image - Image similarity search
  • GET /search/suggestions - Search suggestions (autocomplete, framework only)
  • GET /search/instant - Instant search (framework only)
  • GET /search/{doc_id} - Get document by ID

Admin

  • GET /admin/health - Health check
  • GET /admin/config - Get configuration
  • GET /admin/stats - Index statistics
  • GET /admin/rewrite-rules - Get query rewrite rules
  • POST /admin/rewrite-rules - Update rewrite rules

New Features (v3.0)

  • Structured Filters: Separated filters (exact match) and range_filters (numeric ranges)
  • Faceted Search: Simplified facet configuration with standardized response format
  • Removed Hardcoded Logic: No more hardcoded price_ranges
  • Search Suggestions: Framework endpoints for autocomplete (implementation pending)

See API_DOCUMENTATION.md for detailed API documentation.

Advanced Features

Boolean Operators

Operator precedence (high to low):

  1. () - Parentheses
  2. ANDNOT - Exclusion
  3. AND - All must match
  4. OR - Any must match
  5. RANK - Ranking boost

Example: laptop AND (gaming OR professional) ANDNOT cheap

Query Rewriting

Configure brand/category mappings:

rewrite_dictionary:
  "��": "brand:�� OR name:��"
  "�w": "category:�w"

Ranking Expressions

Configurable ranking with functions:

bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)

SPU Aggregation

Enable SPU collapse to show one SKU per SPU:

spu_config:
  enabled: true
  spu_field: "spu_id"
  inner_hits_size: 3

Performance Tips

  1. Embedding Cache: Enable caching to avoid recomputing embeddings
  2. Batch Size: Adjust batch size based on memory (default: 100 for transform, 500 for indexing)
  3. ES Sharding: Configure shards based on cluster size
  4. GPU Acceleration: Use CUDA for faster embedding generation

Development

Run Tests

pytest tests/

Format Code

black .

Type Checking

mypy .

License

Proprietary - All Rights Reserved