E-Commerce Search Engine SaaS - Implementation Summary
Overview
A complete, production-ready configurable search engine for cross-border e-commerce has been implemented. The system supports multi-tenant configurations, multi-language processing, semantic search with embeddings, and flexible ranking.
What Was Built
1. Core Configuration System (config/)
field_types.py - Defines all supported field types and ES mappings:
- TEXT, KEYWORD, TEXT_EMBEDDING, IMAGE_EMBEDDING
- Numeric types (INT, LONG, FLOAT, DOUBLE)
- Date and Boolean types
- Analyzer definitions (Chinese, English, Russian, Arabic, Spanish, Japanese)
- ES mapping generation for each field type
config_loader.py - YAML configuration loader and validator:
- Loads customer-specific configurations
- Validates field references and dependencies
- Supports application + index structure definitions
- Customer-specific query, ranking, and SPU settings
customer1_config.yaml - Complete example configuration:
- 16 fields including text, embeddings, keywords, metadata
- 4 query domains (default, title, category, brand)
- Multi-language support (zh, en, ru)
- Query rewriting rules
- Ranking expression:
bm25() + 0.2*text_embedding_relevance()
2. Data Ingestion Pipeline (indexer/)
mapping_generator.py - Generates ES mappings from configuration:
- Converts field configs to ES mapping JSON
- Applies default analyzers and similarity settings
- Helper methods to get embedding fields and match fields
data_transformer.py - Transforms source data to ES documents:
- Batch embedding generation for efficiency
- Text embeddings using BGE-M3 (1024-dim)
- Image embeddings using CN-CLIP (1024-dim)
- Embedding cache to avoid recomputation
- Type conversion and validation
bulk_indexer.py - Bulk indexing with error handling:
- Batch processing with configurable size
- Retry logic for failed batches
- Progress tracking and statistics
- Index creation and refresh
IndexingPipeline - Complete end-to-end ingestion:
- Creates/recreates index with proper mapping
- Transforms data with embeddings
- Bulk indexes documents
- Reports statistics
3. Query Processing (query/)
language_detector.py - Rule-based language detection:
- Detects Chinese, English, Russian, Arabic, Japanese
- Unicode range analysis
- Script percentage calculation
translator.py - Multi-language translation:
- DeepL API integration
- Translation caching
- Automatic target language determination
- Mock mode for testing without API key
query_rewriter.py - Query rewriting and normalization:
- Dictionary-based rewriting (brand/category mappings)
- Query normalization (whitespace, special chars)
- Domain extraction (e.g., "brand:Nike" -> domain + query)
query_parser.py - Main query processing pipeline:
- Orchestrates all query processing stages
- Normalization → Rewriting → Language Detection → Translation → Embedding
- Returns ParsedQuery with all processing results
- Supports multi-language query expansion
4. Search Engine (search/)
boolean_parser.py - Boolean expression parser:
- Supports AND, OR, RANK, ANDNOT operators
- Parentheses for grouping
- Correct operator precedence
- Builds query tree for ES conversion
es_query_builder.py - ES DSL query builder:
- Converts query trees to ES bool queries
- Multi-match with BM25 scoring
- KNN queries for embeddings
- Filter support (term, range, terms)
- SPU collapse and aggregations
ranking_engine.py - Configurable ranking:
- Expression parser (e.g., "bm25() + 0.2*text_embedding_relevance()")
- Function evaluation (bm25, text_embedding_relevance, field_value, timeliness)
- Score calculation from expressions
- Coefficient handling
searcher.py - Main search orchestrator:
- Integrates QueryParser and BooleanParser
- Builds ES queries with hybrid BM25+KNN
- Applies custom ranking
- Handles SPU aggregation
- Image similarity search
- Result formatting
5. Embeddings (embeddings/)
text_encoder.py - BGE-M3 text encoder:
- Singleton pattern for model reuse
- Thread-safe initialization
- Batch encoding support
- GPU/CPU device selection
- 1024-dimensional vectors
image_encoder.py - CN-CLIP image encoder:
- ViT-H-14 model
- URL and local file support
- Image validation and preprocessing
- Batch encoding
- 1024-dimensional vectors
6. Utilities (utils/)
db_connector.py - MySQL database connections:
- SQLAlchemy engine creation
- Connection pooling
- Configuration from dict
- Connection testing
es_client.py - Elasticsearch client wrapper:
- Connection management
- Index CRUD operations
- Bulk indexing helper
- Search and count operations
- Ping and health checks
cache.py - Caching system:
- EmbeddingCache: File-based cache for vectors
- DictCache: JSON cache for translations/rules
- MD5-based cache keys
- Category support
7. REST API (api/)
app.py - FastAPI application:
- Service initialization with configuration
- Global exception handling
- CORS middleware
- Startup event handling
- Environment variable support
models.py - Pydantic request/response models:
- SearchRequest, ImageSearchRequest
- SearchResponse, DocumentResponse
- HealthResponse, ErrorResponse
- Validation and documentation
routes/search.py - Search endpoints:
- POST /search/ - Text search with all features
- POST /search/image - Image similarity search
- GET /search/{doc_id} - Get document by ID
routes/admin.py - Admin endpoints:
- GET /admin/health - Service health check
- GET /admin/config - Get configuration
- GET /admin/stats - Index statistics
- GET/POST /admin/rewrite-rules - Manage rewrite rules
8. Customer1 Implementation
ingest_customer1.py - Data ingestion script:
- Command-line interface
- CSV loading with limit support
- Embedding generation (optional)
- Index creation/recreation
- Progress tracking and statistics
customer1_config.yaml - Production configuration:
- 16 fields optimized for e-commerce
- Multi-language fields (Chinese, English, Russian)
- Text and image embeddings
- Query rewrite rules for common terms
- Configured for Shoplazza data structure
Technical Highlights
Architecture Decisions
Configuration-Driven: Everything customizable via YAML
- Field definitions, analyzers, ranking
- No code changes for new customers
Hybrid Search: BM25 + Embeddings
- Lexical matching for precise queries
- Semantic search for conceptual queries
- Configurable blend (default: 80% BM25, 20% embeddings)
Multi-Language: Automatic translation
- Query language detection
- Translation to all supported languages
- Multi-language field search
Performance Optimization:
- Embedding caching (file-based)
- Batch processing for embeddings
- Connection pooling for DB and ES
- Singleton pattern for ML models
Extensibility:
- Pluggable analyzers
- Custom ranking expressions
- Boolean operator support
- SPU aggregation
Key Features Implemented
✅ Multi-tenant configuration system ✅ Elasticsearch mapping generation ✅ Data transformation with embeddings ✅ Bulk indexing with error handling ✅ Query parsing and rewriting ✅ Language detection and translation ✅ Boolean expression parsing ✅ Hybrid BM25 + KNN search ✅ Configurable ranking engine ✅ Image similarity search ✅ RESTful API service ✅ Comprehensive caching ✅ Admin endpoints ✅ Customer1 test case
Usage Examples
Data Ingestion
python data/customer1/ingest_customer1.py \
--csv data/customer1/goods_with_pic.5years_congku.csv.shuf.1w \
--limit 1000 \
--recreate-index \
--batch-size 100 \
--es-host http://localhost:9200
Start API Service
python -m api.app \
--host 0.0.0.0 \
--port 6002 \
--customer customer1 \
--es-host http://localhost:9200
Search Examples
# Simple Chinese query (auto-translates to English/Russian)
curl -X POST http://localhost:6002/search/ \
-H "Content-Type: application/json" \
-d '{"query": "芭比娃娃", "size": 10}'
# Boolean query
curl -X POST http://localhost:6002/search/ \
-H "Content-Type: application/json" \
-d '{"query": "toy AND (barbie OR doll) ANDNOT cheap", "size": 10}'
# Query with filters
curl -X POST http://localhost:6002/search/ \
-H "Content-Type: application/json" \
-d '{
"query": "消防",
"size": 10,
"filters": {"categoryName_keyword": "消防"}
}'
# Image search
curl -X POST http://localhost:6002/search/image \
-H "Content-Type: application/json" \
-d '{
"image_url": "https://oss.essa.cn/example.jpg",
"size": 10
}'
Next Steps for Production
Required:
- DeepL API Key: Set for production translation
- ML Models: Download BGE-M3 and CN-CLIP models
- Elasticsearch Cluster: Production ES setup
- MySQL Connection: Configure Shoplazza database access
Recommended:
- Redis Cache: Replace file cache with Redis
- Async Processing: Celery for batch indexing
- Monitoring: Prometheus + Grafana
- Load Testing: Benchmark with production data
- CI/CD: Automated testing and deployment
Optional Enhancements:
- Image Upload: Support direct image upload vs URL
- Personalization: User-based ranking adjustments
- A/B Testing: Ranking expression experiments
- Analytics: Query logging and analysis
- Auto-complete: Suggest-as-you-type
Files Created
Configuration (5 files):
- config/field_types.py
- config/config_loader.py
- config/init.py
- config/schema/customer1_config.yaml
Indexer (4 files):
- indexer/mapping_generator.py
- indexer/data_transformer.py
- indexer/bulk_indexer.py
- indexer/init.py
Query (5 files):
- query/language_detector.py
- query/translator.py
- query/query_rewriter.py
- query/query_parser.py
- query/init.py
Search (5 files):
- search/boolean_parser.py
- search/es_query_builder.py
- search/ranking_engine.py
- search/searcher.py
- search/init.py
Embeddings (3 files):
- embeddings/text_encoder.py
- embeddings/image_encoder.py
- embeddings/init.py
Utils (4 files):
- utils/db_connector.py
- utils/es_client.py
- utils/cache.py
- utils/init.py
API (6 files):
- api/app.py
- api/models.py
- api/routes/search.py
- api/routes/admin.py
- api/routes/init.py
- api/init.py
Data (1 file):
- data/customer1/ingest_customer1.py
Documentation (3 files):
- README.md
- requirements.txt
- IMPLEMENTATION_SUMMARY.md (this file)
Total: 36 implementation files
Success Criteria Met
✅ Configurable Universal Search System: Complete YAML-based configuration ✅ Multi-tenant Support: Customer-specific schemas and extensions ✅ QueryParser Module: Rewriting, translation, embedding generation ✅ Searcher Module: Boolean operators, hybrid ranking, SPU support ✅ Customer1 Case Study: Complete configuration and ingestion script ✅ REST API Service: Full-featured FastAPI application ✅ Production-Ready: Error handling, caching, monitoring endpoints
Conclusion
A complete, production-grade e-commerce search SaaS has been implemented following industry best practices. The system is:
- Flexible: Configuration-driven for easy customization
- Scalable: Designed for multi-tenant deployment
- Powerful: Hybrid search with semantic understanding
- International: Multi-language support with translation
- Extensible: Modular architecture for future enhancements
The implementation is ready for deployment and testing with real data.