CLAUDE.md
This file provides comprehensive guidance for Claude Code (claude.ai/code) when working with this Search Engine codebase.
Project Overview
This is a production-ready Multi-Tenant E-Commerce Search SaaS platform specifically designed for Shoplazza (店匠) independent sites. It's a sophisticated search system that combines traditional keyword-based search with modern AI/ML capabilities, serving multiple tenants from a unified infrastructure.
Core Architecture Philosophy:
- Unified Multi-Tenant Design: Single Elasticsearch index with tenant isolation via
tenant_id - Hybrid Search Engine: BM25 text relevance + Dense vector similarity (BGE-M3)
- SPU-Centric Indexing: Product-level indexing with nested SKU structures
- Production-Grade: Comprehensive error handling, monitoring, and operational features
Tech Stack:
- Search Backend: Elasticsearch 8.x with custom BM25 similarity (b=0.0, k1=0.0)
- Data Source: MySQL (Shoplazza database) with custom data transformers
- Backend Framework: FastAPI with async support and comprehensive middleware
- ML/AI Models: BGE-M3 for text embeddings (1024-dim), CN-CLIP for image embeddings (1024-dim)
- Language Processing: Multi-language support (Chinese, English, Russian) with DeepL API
- API Layer: RESTful FastAPI service on port 6002 with auto-generated documentation
- Frontend: Debugging UI on port 6003 with real-time search capabilities
Development Environment
Required Environment Setup:
source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine
Database Configuration:
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R
Service Endpoints:
- Backend API: http://localhost:6002 (FastAPI)
- Frontend UI: http://localhost:6003 (Debug interface)
- Elasticsearch: http://localhost:9200
- API Documentation: http://localhost:6002/docs
Common Development Commands
Environment Setup
# Complete environment setup
./setup.sh
# Install Python dependencies
pip install -r requirements.txt
Data Management
# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh
# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50
Running Services
# Start all services (production)
./run.sh
# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload
# Start frontend debugging UI
./scripts/start_frontend.sh
Testing
# Run all tests
python -m pytest tests/
# Run specific test types
python -m pytest tests/unit/ # Unit tests
python -m pytest tests/integration/ # Integration tests
python -m pytest -m "api" # API tests only
# Test search from command line
python main.py search "query" --tenant-id 1 --size 10
Development Utilities
# Stop all services
./scripts/stop.sh
# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh
# Install server dependencies
./scripts/install_server_deps.sh
Architecture Overview
Core Components
/data/tw/SearchEngine/
├── api/ # FastAPI REST API service (port 6002)
├── config/ # Configuration management system
├── indexer/ # MySQL → Elasticsearch data pipeline
├── search/ # Search engine and ranking logic
├── query/ # Query parsing, translation, rewriting
├── embeddings/ # ML models (BGE-M3, CN-CLIP)
├── scripts/ # Automation and utility scripts
├── utils/ # Shared utilities (ES client, etc.)
├── frontend/ # Simple debugging UI
├── mappings/ # Elasticsearch index mappings
└── tests/ # Unit and integration tests
Data Flow Architecture
Pipeline: MySQL → Indexer → Elasticsearch → API → Frontend
Data Source Layer:
- Shoplazza MySQL database with
shoplazza_product_skuandshoplazza_product_sputables - Tenant-specific extension tables for custom attributes and multi-language fields
- Shoplazza MySQL database with
Indexing Layer (
indexer/):- Reads from MySQL, applies transformations with embeddings
- Uses
DataTransformerandIndexingPipelinefor batch processing - Supports both full and incremental indexing with embedding caching
Query Processing Layer (
query/):QueryParser: Handles query rewriting, translation, and text embedding conversion- Multi-language support with automatic detection and translation
- Boolean logic parsing with operator precedence:
()>ANDNOT>AND>OR>RANK
Search Engine Layer (
search/):Searcher: Executes hybrid searches combining BM25 and dense vectors- Configurable ranking expressions with function_score support
- Multi-tenant isolation via
tenant_idfield
API Layer (
api/):- FastAPI service on port 6002 with multi-tenant support
- Text search:
POST /search/ - Image search:
POST /image-search/ - Tenant identification via
X-Tenant-IDheader
Multi-Tenant Configuration System
The system uses centralized configuration through config/config.yaml:
Field Configuration (
config/field_types.py):- Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
- Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
- Required fields and preprocessing rules
Index Configuration (
mappings/search_products.json):- Unified index structure shared by all tenants
- Elasticsearch field mappings and analyzer configurations
- BM25 similarity with modified parameters (
b=0.0, k1=0.0)
Query Configuration (
search/query_config.py):- Query domain definitions (default, category_name, title, brand_name, etc.)
- Ranking expressions and function_score configurations
- Translation and embedding settings
Embedding Models
Text Embedding (embeddings/bge_encoder.py):
- Uses BGE-M3 model (
Xorbits/bge-m3) - Singleton pattern with thread-safe initialization
- Generates 1024-dimensional vectors with GPU/CUDA support
- Configurable caching to avoid recomputation
Image Embedding (embeddings/clip_encoder.py):
- Uses CN-CLIP model (ViT-H-14)
- Downloads and preprocesses images from URLs
- Supports both local and remote image processing
- Generates 1024-dimensional vectors
Search and Ranking
Hybrid Search Approach:
- Combines traditional BM25 text relevance with dense vector similarity
- Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
- Configurable ranking expressions like:
static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)
Boolean Search Support:
- Full boolean logic with AND, OR, ANDNOT, RANK operators
- Parentheses for complex query structures
- Configurable operator precedence
Faceted Search:
- Terms and range faceting support
- Multi-dimensional filtering capabilities
- Configurable facet fields and aggregations
Testing Infrastructure
Test Framework: pytest with async support
Test Structure:
tests/conftest.py: Comprehensive test fixtures and configurationtests/unit/: Unit tests for individual componentstests/integration/: Integration tests for system workflows- Test markers:
@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.api
Test Data:
- Tenant1: Mock data with 10,000 product records
- Tenant2: CSV-based test dataset
- Automated test data generation via
scripts/mock_data.sh
Key Test Fixtures (from conftest.py):
sample_search_config: Complete configuration for testingmock_es_client: Mocked Elasticsearch clienttest_searcher: Searcher instance with mock dependenciestemp_config_file: Temporary YAML configuration for tests
API Endpoints
Main API (FastAPI on port 6002):
POST /search/- Text search with multi-language supportPOST /image-search/- Image search using CN-CLIP embeddings- Health check and management endpoints
- Multi-tenant support via
X-Tenant-IDheader
API Features:
- Hybrid search combining text and vector similarity
- Configurable ranking and filtering
- Faceted search with aggregations
- Multi-language query processing and translation
- Real-time search with configurable result sizes
Core System Architecture & Design
Unified Multi-Tenant Index Structure
Index Design Philosophy: Single search_products index shared by all tenants with data isolation via tenant_id field.
Key Benefits:
- Resource efficiency and cost optimization
- Simplified maintenance and operations
- Better query performance through optimized sharding
- Easier cross-tenant analytics and monitoring
Index Structure (from mappings/search_products.json)
Core Document Structure (SPU-level):
{
"tenant_id": "keyword", // Multi-tenant isolation
"spu_id": "keyword", // Product identifier
"title_zh/en": "text", // Multi-language titles
"brief_zh/en": "text", // Short descriptions
"description_zh/en": "text", // Detailed descriptions
"vendor_zh/en": "text", // Supplier/brand with keyword subfield
"category_path_zh/en": "text", // Hierarchical category paths
"category_name_zh/en": "text", // Category names for search
"category1/2/3_name": "keyword", // Multi-level category filtering
"tags": "keyword", // Product tags
"specifications": "nested", // Product variants (color, size, etc.)
"skus": "nested", // Detailed SKU information
"min_price/max_price": "float", // Price range calculations
"title_embedding": "dense_vector", // 1024-dim semantic vectors
"image_embedding": "nested", // Image vectors for visual search
"total_inventory": "long" // Aggregate inventory
}
Analyzers Configuration:
- Chinese fields:
hanlp_index(indexing) /hanlp_standard(searching) - English fields:
englishanalyzer - BM25 Similarity: Custom parameters (b=0.0, k1=0.0) for optimized scoring
Data Source Architecture (from 索引字段说明v2-参考表结构.md)
Primary MySQL Tables:
shoplazza_product_spu (Product Level):
- id, shop_id, shoplazza_id, handle
- title, brief, description, vendor
- category, category_id, category_level, category_path
- image_src, image_width, image_height
- tags, fake_sales, published
- inventory_policy, inventory_quantity
- seo_title, seo_description, seo_keywords
- tenant_id, create_time, update_time
shoplazza_product_sku (Variant Level):
- id, spu_id, shop_id, shoplazza_id
- title, sku, barcode
- price, compare_at_price, cost_price
- option1, option2, option3 (variant values)
- inventory_quantity, weight, weight_unit
- image_src, wholesale_price, extend
- tenant_id, create_time, update_time
Data Transformation Pipeline:
- SPU-Centric Aggregation: Group SKUs under parent SPU
- Multi-Language Field Mapping: MySQL → ES bilingual fields
- Category Path Parsing: Extract hierarchical categories
- Specifications Building: Create nested variant structures
- Price Range Calculation: min/max across all SKUs
- Vector Generation: BGE-M3 embeddings for titles
Advanced Search Configuration (from config/config.yaml)
Field Boost Configuration:
field_boosts:
title_zh/en: 3.0 # Highest priority
brief_zh/en: 1.5 # Medium priority
description_zh/en: 1.0 # Lower priority
vendor_zh/en: 1.5 # Brand emphasis
category_path_zh/en: 1.5 # Category relevance
tags: 1.0 # Tag matching
Search Domains:
- default: Comprehensive search across all text fields
- title: Title-focused search (boost: 2.0)
- vendor: Brand-specific search (boost: 1.5)
- category: Category-focused search (boost: 1.5)
- tags: Tag-based search (boost: 1.0)
Query Processing Features:
query_config:
supported_languages: ["zh", "en", "ru"]
enable_translation: true # DeepL API integration
enable_text_embedding: true # BGE-M3 vector search
enable_query_rewrite: true # Dictionary-based expansion
embedding_disable_thresholds:
chinese_char_limit: 4 # Short query optimization
english_word_limit: 3
Ranking Formula:
bm25() + 0.2*text_embedding_relevance()
Sophisticated Query Processing Pipeline
Multi-Language Search Architecture:
- Query Normalization: Clean and standardize input
- Language Detection: Automatic identification (zh/en/ru)
- Query Rewriting: Dictionary-based expansion and synonyms
- Translation Service: DeepL API for cross-language search
- Vector Generation: BGE-M3 embeddings for semantic search
- Boolean Parsing: Complex expression evaluation
Boolean Expression Support:
- Operators: AND, OR, ANDNOT, RANK, parentheses
- Precedence:
()>ANDNOT>AND>OR>RANK - Example:
玩具 AND (乐高 OR 芭比) ANDNOT 电动
E-Commerce Specialized Features
Specifications System (Product Variants):
"specifications": [
{
"sku_id": "sku_123",
"name": "color",
"value": "white"
},
{
"sku_id": "sku_123",
"name": "size",
"value": "256GB"
}
]
Advanced Filtering Logic:
- Different dimensions (different
name): AND relationship - Same dimension (same
name): OR relationship - Example:
(color=white OR color=black) AND size=256GB
Faceted Search Capabilities:
- Category Faceting: Multi-level category aggregations
- Specifications Faceting: Nested aggregations by variant name
- Range Faceting: Price ranges, date ranges
- Multi-Select Support: Disjunctive faceting for filters
SKU Filtering System:
- Dimension-based Grouping: Filter by
option1/2/3or specification names - Application Layer: Performance-optimized filtering outside ES
- Use Case: Display one SKU per variant combination (e.g., one per color)
API Architecture & Usage (from 搜索API对接指南.md)
Core API Endpoints:
POST /search/ # Main text search
POST /search/image # Image search (CN-CLIP)
GET /search/{doc_id} # Document retrieval
GET /admin/health # Health check
GET /admin/config # Configuration info
GET /admin/stats # Index statistics
Request Structure:
{
"query": "string (required)",
"size": 10, "from": 0,
"language": "zh|en",
"filters": {}, "range_filters": {},
"facets": [],
"sort_by": "price|sales|create_time",
"sort_order": "asc|desc",
"sku_filter_dimension": ["color", "size"],
"min_score": 0.0,
"debug": false
}
Advanced Filter Examples:
Specifications Filtering:
{
"filters": {
"specifications": {
"name": "color",
"value": "white"
}
}
}
Multi-Dimension Specifications:
{
"filters": {
"specifications": [
{"name": "color", "value": "white"},
{"name": "size", "value": "256GB"}
]
}
}
Range Filtering:
{
"range_filters": {
"min_price": {"gte": 50, "lte": 200},
"create_time": {"gte": "2024-01-01T00:00:00Z"}
}
}
Faceted Search Configuration:
{
"facets": [
{"field": "category1_name", "size": 15, "type": "terms"},
{"field": "specifications.color", "size": 20, "type": "terms"},
{"field": "min_price", "type": "range", "ranges": [...]}
]
}
Multi-Select Faceting (NEW FEATURE)
Standard Mode (multi_select: false):
- Behavior: Selected value becomes the only option shown
- Use Case: Hierarchical category navigation
- Example: Toys → Dolls → Barbie
Multi-Select Mode (multi_select: true) ⭐:
- Behavior: All options remain visible after selection
- Use Case: Colors, brands, sizes (switchable attributes)
- Example: Select "red" but still see "blue", "green", etc.
Recommended Configuration:
| Facet Type | Multi-Select | Reason |
|-----------|-------------|---------|
| Color | true | Users need to switch colors |
| Brand | true | Users need to compare brands |
| Size | true | Users need to check other sizes |
| Category | false | Hierarchical navigation |
| Price Range | false | Mutually exclusive ranges |
Production Features & Operations
Comprehensive Error Handling:
- Graceful degradation for model failures
- Fallback mechanisms for service unavailability
- Detailed error logging and context tracking
Performance Optimizations:
- Embedding caching to avoid redundant computations
- Adaptive vector search (disabled for short queries)
- Batch processing for data indexing
- Connection pooling for database operations
Security & Multi-Tenancy:
- Strict tenant isolation via
tenant_idfiltering - Rate limiting with SlowAPI integration
- CORS and security headers middleware
- Request context logging for auditing
Monitoring & Observability:
- Structured logging with request tracing
- Health check endpoints for all dependencies
- Performance metrics and timing information
- Index statistics and document counts
Data Model Insights
Key Design Decisions:
SPU over SKU Indexing: Each ES document represents a product (SPU) with nested SKUs
- Reduces index size and improves search performance
- Maintains variant information through nested structures
Bilingual Field Strategy: Separate
*_zhand*_enfields- Enables language-specific analyzer configuration
- Provides fallback mechanisms for missing translations
Nested vs Flat Design: Strategic use of nested vs flattened fields
specificationsandskus: Nested for complex queriesmin_price,total_inventory: Flattened for filtering/sorting
Vector Field Isolation: Embedding fields only used for search
- Not returned in API responses (index: false where appropriate)
- Reduces network payload and improves response times
AI/ML Integration Details
Text Embedding Pipeline:
- Model: BGE-M3 (
Xorbits/bge-m3) - Dimensions: 1024-dimensional vectors
- Hardware: GPU/CUDA acceleration with CPU fallback
- Caching: Redis-based caching for common queries
- Usage: Semantic search combined with BM25 relevance
Image Search Pipeline:
- Model: CN-CLIP (ViT-H-14)
- Processing: URL download → preprocessing → vectorization
- Storage: Nested structure with vector + original URL
- Application: Visual similarity search for products
Translation Integration:
- Service: DeepL API with configurable auth
- Languages: Chinese ↔ English ↔ Russian support
- Caching: Translation result caching
- Fallback: Mock mode returns original text if API unavailable
Development & Deployment
Environment Configuration:
# Core Services
./run.sh # Start all services
./scripts/start_backend.sh # Backend only (port 6002)
./scripts/start_frontend.sh # Frontend UI (port 6003)
# Data Operations
./scripts/ingest.sh <tenant_id> [recreate] # Index data
./scripts/mock_data.sh # Generate test data
# Testing
python -m pytest tests/ # Full test suite
python main.py search "query" --tenant-id 1 # Quick search test
Key Files for Configuration:
config/config.yaml: Search behavior configurationmappings/search_products.json: ES index structure.env: Environment variables and secretsapi/models.py: Pydantic request/response models
Common Development Tasks:
- Modifying Search Behavior: Edit
config/config.yaml - Changing Index Structure: Update
mappings/search_products.json - Adding New Filters: Extend
api/models.pywith new Pydantic models - Updating Ranking: Modify
ranking.expressionin config - Testing Queries: Use frontend UI at http://localhost:6003
Key Implementation Details
- Environment Variables: All sensitive configuration in
.env(template:.env.example) - Configuration Management: Centralized YAML config with validation
- Error Handling: Comprehensive exception handling with proper HTTP status codes
- Performance: Batch processing, embedding caching, connection pooling
- Logging: Structured logging with request tracing and context
- Security: Tenant isolation, rate limiting, CORS, security headers
- API Documentation: Auto-generated FastAPI docs at
/docs - Multi-tenant Architecture: Single index with
tenant_idisolation - Hybrid Search: BM25 + vector similarity with configurable weighting
- Production Ready: Health checks, monitoring, graceful degradation