README.md
E-Commerce Search Engine SaaS
A configurable, multi-tenant search engine for cross-border e-commerce, built for Shoplazza independent sites.
Features
- Multi-language Support: Chinese, English, Russian, Arabic, Spanish, Japanese with automatic translation
- Semantic Search: Text and image embeddings using BGE-M3 and CN-CLIP models
- Hybrid Ranking: Combines BM25 text relevance with semantic similarity
- Boolean Operators: Supports AND, OR, RANK, ANDNOT with proper precedence
- Configurable: Customer-specific schemas, analyzers, and ranking expressions
- Multi-tenant: Each customer has isolated configuration and extension tables
- RESTful API: FastAPI-based service with comprehensive endpoints
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Start Elasticsearch
# Using Docker
docker run -d \
--name elasticsearch \
-p 9200:9200 \
-e "discovery.type=single-node" \
-e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
elasticsearch:8.11.0
3. Ingest Customer1 Test Data
cd data/customer1
python ingest_customer1.py \
--csv goods_with_pic.5years_congku.csv.shuf.1w \
--limit 1000 \
--recreate-index \
--es-host http://localhost:9200
4. Start API Service
python -m api.app \
--host 0.0.0.0 \
--port 8000 \
--customer customer1 \
--es-host http://localhost:9200 \
--reload
5. Test Search
# Simple search
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "��", "size": 10}'
# Search with filters
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{
"query": "toy",
"size": 10,
"filters": {"categoryName_keyword": "�w"}
}'
# Boolean search
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "�� AND ( OR doll)", "size": 10}'
# Image search
curl -X POST http://localhost:8000/search/image \
-H "Content-Type: application/json" \
-d '{
"image_url": "https://oss.essa.cn/example.jpg",
"size": 10
}'
Project Structure
SearchEngine/
config/ # Configuration system
field_types.py # Field type definitions
config_loader.py # Config loader & validator
schema/ # Customer configurations
customer1_config.yaml
indexer/ # Data ingestion
mapping_generator.py # ES mapping generator
data_transformer.py # Data transformation
bulk_indexer.py # Bulk indexing
query/ # Query processing
query_parser.py # Main query parser
language_detector.py # Language detection
translator.py # Translation service
query_rewriter.py # Query rewriting
search/ # Search execution
searcher.py # Main searcher
boolean_parser.py # Boolean expression parser
es_query_builder.py # ES DSL builder
ranking_engine.py # Ranking engine
embeddings/ # Embedding encoders
text_encoder.py # BGE-M3 text encoder
image_encoder.py # CN-CLIP image encoder
utils/ # Utilities
db_connector.py # MySQL connector
es_client.py # ES client wrapper
cache.py # Embedding cache
api/ # REST API
app.py # FastAPI application
models.py # Request/response models
routes/ # API routes
search.py # Search endpoints
admin.py # Admin endpoints
Configuration
Customer Configuration Example
See config/schema/customer1_config.yaml for a complete example.
Key sections:
- fields: Define data fields, types, analyzers, and embedding configuration
- indexes: Define query domains (default, title, brand, category)
- query_config: Multi-language, translation, embedding settings
- ranking: Ranking expression (e.g.,
bm25() + 0.2*text_embedding_relevance()) - spu_config: SPU aggregation settings
Field Types
TEXT: Analyzed text with configurable analyzerKEYWORD: Exact match keywordTEXT_EMBEDDING: 1024-dim text vector (BGE-M3)IMAGE_EMBEDDING: 1024-dim image vector (CN-CLIP)INT/LONG/FLOAT/DOUBLE: Numeric fieldsDATE: Date/timestamp fields
Analyzers
chinese_ecommerce: Ansj Chinese analyzerenglish,arabic,spanish,russian,japanese: Language-specific analyzers
API Endpoints
Search
POST /search/- Text search with filters and facetsPOST /search/image- Image similarity searchGET /search/suggestions- Search suggestions (autocomplete, framework only)GET /search/instant- Instant search (framework only)GET /search/{doc_id}- Get document by ID
Admin
GET /admin/health- Health checkGET /admin/config- Get configurationGET /admin/stats- Index statisticsGET /admin/rewrite-rules- Get query rewrite rulesPOST /admin/rewrite-rules- Update rewrite rules
New Features (v3.0)
- ✅ Structured Filters: Separated
filters(exact match) andrange_filters(numeric ranges) - ✅ Faceted Search: Simplified facet configuration with standardized response format
- ✅ Removed Hardcoded Logic: No more hardcoded
price_ranges - ✅ Search Suggestions: Framework endpoints for autocomplete (implementation pending)
See API_DOCUMENTATION.md for detailed API documentation.
Advanced Features
Boolean Operators
Operator precedence (high to low):
()- ParenthesesANDNOT- ExclusionAND- All must matchOR- Any must matchRANK- Ranking boost
Example: laptop AND (gaming OR professional) ANDNOT cheap
Query Rewriting
Configure brand/category mappings:
rewrite_dictionary:
"��": "brand:�� OR name:��"
"�w": "category:�w"
Ranking Expressions
Configurable ranking with functions:
bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)
SPU Aggregation
Enable SPU collapse to show one SKU per SPU:
spu_config:
enabled: true
spu_field: "spu_id"
inner_hits_size: 3
Performance Tips
- Embedding Cache: Enable caching to avoid recomputing embeddings
- Batch Size: Adjust batch size based on memory (default: 100 for transform, 500 for indexing)
- ES Sharding: Configure shards based on cluster size
- GPU Acceleration: Use CUDA for faster embedding generation
Development
Run Tests
pytest tests/
Format Code
black .
Type Checking
mypy .
License
Proprietary - All Rights Reserved