# CLAUDE.md This file provides comprehensive guidance for Claude Code (claude.ai/code) when working with this Search Engine codebase. ## Project Overview This is a **production-ready Multi-Tenant E-Commerce Search SaaS** platform specifically designed for Shoplazza (店匠) independent sites. It's a sophisticated search system that combines traditional keyword-based search with modern AI/ML capabilities, serving multiple tenants from a unified infrastructure. **Core Architecture Philosophy:** - **Unified Multi-Tenant Design**: Single Elasticsearch index with tenant isolation via `tenant_id` - **Hybrid Search Engine**: BM25 text relevance + Dense vector similarity (BGE-M3) - **SPU-Centric Indexing**: Product-level indexing with nested SKU structures - **Production-Grade**: Comprehensive error handling, monitoring, and operational features **Tech Stack:** - **Search Backend**: Elasticsearch 8.x with custom BM25 similarity (b=0.0, k1=0.0) - **Data Source**: MySQL (Shoplazza database) with custom data transformers - **Backend Framework**: FastAPI with async support and comprehensive middleware - **ML/AI Models**: BGE-M3 for text embeddings (1024-dim), CN-CLIP for image embeddings (1024-dim) - **Language Processing**: Multi-language support (Chinese, English, Russian) with DeepL API - **API Layer**: RESTful FastAPI service on port 6002 with auto-generated documentation - **Frontend**: Debugging UI on port 6003 with real-time search capabilities ## Development Environment **Required Environment Setup:** ```bash source /home/tw/miniconda3/etc/profile.d/conda.sh conda activate searchengine ``` **Database Configuration:** ```yaml host: 120.79.247.228 port: 3316 database: saas username: saas password: P89cZHS5d7dFyc9R ``` **Service Endpoints:** - **Backend API**: http://localhost:6002 (FastAPI) - **Frontend UI**: http://localhost:6003 (Debug interface) - **Elasticsearch**: http://localhost:9200 - **API Documentation**: http://localhost:6002/docs ## Common Development Commands ### Environment Setup ```bash # Complete environment setup ./setup.sh # Install Python dependencies pip install -r requirements.txt ``` ### Data Management ```bash # Generate test data (Tenant1 Mock + Tenant2 CSV) ./scripts/mock_data.sh # Ingest data to Elasticsearch ./scripts/ingest.sh [recreate] # e.g., ./scripts/ingest.sh 1 true python main.py ingest data.csv --limit 1000 --batch-size 50 ``` ### Running Services ```bash # Start all services (production) ./run.sh # Start development server with auto-reload ./scripts/start_backend.sh python main.py serve --host 0.0.0.0 --port 6002 --reload # Start frontend debugging UI ./scripts/start_frontend.sh ``` ### Testing ```bash # Run all tests python -m pytest tests/ # Run specific test types python -m pytest tests/unit/ # Unit tests python -m pytest tests/integration/ # Integration tests python -m pytest -m "api" # API tests only # Test search from command line python main.py search "query" --tenant-id 1 --size 10 ``` ### Development Utilities ```bash # Stop all services ./scripts/stop.sh # Test environment (for CI/development) ./scripts/start_test_environment.sh ./scripts/stop_test_environment.sh # Install server dependencies ./scripts/install_server_deps.sh ``` ## Architecture Overview ### Core Components ``` /data/tw/SearchEngine/ ├── api/ # FastAPI REST API service (port 6002) ├── config/ # Configuration management system ├── indexer/ # MySQL → Elasticsearch data pipeline ├── search/ # Search engine and ranking logic ├── query/ # Query parsing, translation, rewriting ├── embeddings/ # ML models (BGE-M3, CN-CLIP) ├── scripts/ # Automation and utility scripts ├── utils/ # Shared utilities (ES client, etc.) ├── frontend/ # Simple debugging UI ├── mappings/ # Elasticsearch index mappings └── tests/ # Unit and integration tests ``` ### Data Flow Architecture **Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend 1. **Data Source Layer**: - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables - Tenant-specific extension tables for custom attributes and multi-language fields 2. **Indexing Layer** (`indexer/`): - Reads from MySQL, applies transformations with embeddings - Uses `DataTransformer` and `IndexingPipeline` for batch processing - Supports both full and incremental indexing with embedding caching 3. **Query Processing Layer** (`query/`): - `QueryParser`: Handles query rewriting, translation, and text embedding conversion - Multi-language support with automatic detection and translation - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK` 4. **Search Engine Layer** (`search/`): - `Searcher`: Executes hybrid searches combining BM25 and dense vectors - Configurable ranking expressions with function_score support - Multi-tenant isolation via `tenant_id` field 5. **API Layer** (`api/`): - FastAPI service on port 6002 with multi-tenant support - Text search: `POST /search/` - Image search: `POST /image-search/` - Tenant identification via `X-Tenant-ID` header ### Multi-Tenant Configuration System The system uses centralized configuration through `config/config.yaml`: 1. **Field Configuration** (`config/field_types.py`): - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc. - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese - Required fields and preprocessing rules 2. **Index Configuration** (`mappings/search_products.json`): - Unified index structure shared by all tenants - Elasticsearch field mappings and analyzer configurations - BM25 similarity with modified parameters (`b=0.0, k1=0.0`) 3. **Query Configuration** (`search/query_config.py`): - Query domain definitions (default, category_name, title, brand_name, etc.) - Ranking expressions and function_score configurations - Translation and embedding settings ### Embedding Models **Text Embedding** (`embeddings/bge_encoder.py`): - Uses BGE-M3 model (`Xorbits/bge-m3`) - Singleton pattern with thread-safe initialization - Generates 1024-dimensional vectors with GPU/CUDA support - Configurable caching to avoid recomputation **Image Embedding** (`embeddings/clip_encoder.py`): - Uses CN-CLIP model (ViT-H-14) - Downloads and preprocesses images from URLs - Supports both local and remote image processing - Generates 1024-dimensional vectors ### Search and Ranking **Hybrid Search Approach**: - Combines traditional BM25 text relevance with dense vector similarity - Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP) - Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)` **Boolean Search Support**: - Full boolean logic with AND, OR, ANDNOT, RANK operators - Parentheses for complex query structures - Configurable operator precedence **Faceted Search**: - Terms and range faceting support - Multi-dimensional filtering capabilities - Configurable facet fields and aggregations ## Testing Infrastructure **Test Framework**: pytest with async support **Test Structure**: - `tests/conftest.py`: Comprehensive test fixtures and configuration - `tests/unit/`: Unit tests for individual components - `tests/integration/`: Integration tests for system workflows - Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api` **Test Data**: - Tenant1: Mock data with 10,000 product records - Tenant2: CSV-based test dataset - Automated test data generation via `scripts/mock_data.sh` **Key Test Fixtures** (from `conftest.py`): - `sample_search_config`: Complete configuration for testing - `mock_es_client`: Mocked Elasticsearch client - `test_searcher`: Searcher instance with mock dependencies - `temp_config_file`: Temporary YAML configuration for tests ## API Endpoints **Main API** (FastAPI on port 6002): - `POST /search/` - Text search with multi-language support - `POST /image-search/` - Image search using CN-CLIP embeddings - Health check and management endpoints - Multi-tenant support via `X-Tenant-ID` header **API Features**: - Hybrid search combining text and vector similarity - Configurable ranking and filtering - Faceted search with aggregations - Multi-language query processing and translation - Real-time search with configurable result sizes ## Core System Architecture & Design ### Unified Multi-Tenant Index Structure **Index Design Philosophy**: Single `search_products` index shared by all tenants with data isolation via `tenant_id` field. **Key Benefits**: - Resource efficiency and cost optimization - Simplified maintenance and operations - Better query performance through optimized sharding - Easier cross-tenant analytics and monitoring ### Index Structure (from mappings/search_products.json) **Core Document Structure** (SPU-level): ```json { "tenant_id": "keyword", // Multi-tenant isolation "spu_id": "keyword", // Product identifier "title_zh/en": "text", // Multi-language titles "brief_zh/en": "text", // Short descriptions "description_zh/en": "text", // Detailed descriptions "vendor_zh/en": "text", // Supplier/brand with keyword subfield "category_path_zh/en": "text", // Hierarchical category paths "category_name_zh/en": "text", // Category names for search "category1/2/3_name": "keyword", // Multi-level category filtering "tags": "keyword", // Product tags "specifications": "nested", // Product variants (color, size, etc.) "skus": "nested", // Detailed SKU information "min_price/max_price": "float", // Price range calculations "title_embedding": "dense_vector", // 1024-dim semantic vectors "image_embedding": "nested", // Image vectors for visual search "total_inventory": "long" // Aggregate inventory } ``` **Analyzers Configuration**: - **Chinese fields**: `hanlp_index` (indexing) / `hanlp_standard` (searching) - **English fields**: `english` analyzer - **BM25 Similarity**: Custom parameters (b=0.0, k1=0.0) for optimized scoring ### Data Source Architecture (from 索引字段说明v2-参考表结构.md) **Primary MySQL Tables**: **shoplazza_product_spu** (Product Level): ```sql - id, shop_id, shoplazza_id, handle - title, brief, description, vendor - category, category_id, category_level, category_path - image_src, image_width, image_height - tags, fake_sales, published - inventory_policy, inventory_quantity - seo_title, seo_description, seo_keywords - tenant_id, create_time, update_time ``` **shoplazza_product_sku** (Variant Level): ```sql - id, spu_id, shop_id, shoplazza_id - title, sku, barcode - price, compare_at_price, cost_price - option1, option2, option3 (variant values) - inventory_quantity, weight, weight_unit - image_src, wholesale_price, extend - tenant_id, create_time, update_time ``` **Data Transformation Pipeline**: 1. **SPU-Centric Aggregation**: Group SKUs under parent SPU 2. **Multi-Language Field Mapping**: MySQL → ES bilingual fields 3. **Category Path Parsing**: Extract hierarchical categories 4. **Specifications Building**: Create nested variant structures 5. **Price Range Calculation**: min/max across all SKUs 6. **Vector Generation**: BGE-M3 embeddings for titles ### Advanced Search Configuration (from config/config.yaml) **Field Boost Configuration**: ```yaml field_boosts: title_zh/en: 3.0 # Highest priority brief_zh/en: 1.5 # Medium priority description_zh/en: 1.0 # Lower priority vendor_zh/en: 1.5 # Brand emphasis category_path_zh/en: 1.5 # Category relevance tags: 1.0 # Tag matching ``` **Search Domains**: - **default**: Comprehensive search across all text fields - **title**: Title-focused search (boost: 2.0) - **vendor**: Brand-specific search (boost: 1.5) - **category**: Category-focused search (boost: 1.5) - **tags**: Tag-based search (boost: 1.0) **Query Processing Features**: ```yaml query_config: supported_languages: ["zh", "en", "ru"] enable_translation: true # DeepL API integration enable_text_embedding: true # BGE-M3 vector search enable_query_rewrite: true # Dictionary-based expansion embedding_disable_thresholds: chinese_char_limit: 4 # Short query optimization english_word_limit: 3 ``` **Ranking Formula**: ``` bm25() + 0.2*text_embedding_relevance() ``` ### Sophisticated Query Processing Pipeline **Multi-Language Search Architecture**: 1. **Query Normalization**: Clean and standardize input 2. **Language Detection**: Automatic identification (zh/en/ru) 3. **Query Rewriting**: Dictionary-based expansion and synonyms 4. **Translation Service**: DeepL API for cross-language search 5. **Vector Generation**: BGE-M3 embeddings for semantic search 6. **Boolean Parsing**: Complex expression evaluation **Boolean Expression Support**: - **Operators**: AND, OR, ANDNOT, RANK, parentheses - **Precedence**: `()` > `ANDNOT` > `AND` > `OR` > `RANK` - **Example**: `玩具 AND (乐高 OR 芭比) ANDNOT 电动` ### E-Commerce Specialized Features **Specifications System** (Product Variants): ```json "specifications": [ { "sku_id": "sku_123", "name": "color", "value": "white" }, { "sku_id": "sku_123", "name": "size", "value": "256GB" } ] ``` **Advanced Filtering Logic**: - **Different dimensions** (different `name`): AND relationship - **Same dimension** (same `name`): OR relationship - **Example**: `(color=white OR color=black) AND size=256GB` **Faceted Search Capabilities**: - **Category Faceting**: Multi-level category aggregations - **Specifications Faceting**: Nested aggregations by variant name - **Range Faceting**: Price ranges, date ranges - **Multi-Select Support**: Disjunctive faceting for filters **SKU Filtering System**: - **Dimension-based Grouping**: Filter by `option1/2/3` or specification names - **Application Layer**: Performance-optimized filtering outside ES - **Use Case**: Display one SKU per variant combination (e.g., one per color) ### API Architecture & Usage (from 搜索API对接指南.md) **Core API Endpoints**: ``` POST /search/ # Main text search POST /search/image # Image search (CN-CLIP) GET /search/{doc_id} # Document retrieval GET /admin/health # Health check GET /admin/config # Configuration info GET /admin/stats # Index statistics ``` **Request Structure**: ```json { "query": "string (required)", "size": 10, "from": 0, "language": "zh|en", "filters": {}, "range_filters": {}, "facets": [], "sort_by": "price|sales|create_time", "sort_order": "asc|desc", "sku_filter_dimension": ["color", "size"], "min_score": 0.0, "debug": false } ``` **Advanced Filter Examples**: **Specifications Filtering**: ```json { "filters": { "specifications": { "name": "color", "value": "white" } } } ``` **Multi-Dimension Specifications**: ```json { "filters": { "specifications": [ {"name": "color", "value": "white"}, {"name": "size", "value": "256GB"} ] } } ``` **Range Filtering**: ```json { "range_filters": { "min_price": {"gte": 50, "lte": 200}, "create_time": {"gte": "2024-01-01T00:00:00Z"} } } ``` **Faceted Search Configuration**: ```json { "facets": [ {"field": "category1_name", "size": 15, "type": "terms"}, {"field": "specifications.color", "size": 20, "type": "terms"}, {"field": "min_price", "type": "range", "ranges": [...]} ] } ``` ### Multi-Select Faceting (NEW FEATURE) **Standard Mode** (`multi_select: false`): - Behavior: Selected value becomes the only option shown - Use Case: Hierarchical category navigation - Example: Toys → Dolls → Barbie **Multi-Select Mode** (`multi_select: true`) ⭐: - Behavior: All options remain visible after selection - Use Case: Colors, brands, sizes (switchable attributes) - Example: Select "red" but still see "blue", "green", etc. **Recommended Configuration**: | Facet Type | Multi-Select | Reason | |-----------|-------------|---------| | Color | `true` | Users need to switch colors | | Brand | `true` | Users need to compare brands | | Size | `true` | Users need to check other sizes | | Category | `false` | Hierarchical navigation | | Price Range | `false` | Mutually exclusive ranges | ### Production Features & Operations **Comprehensive Error Handling**: - Graceful degradation for model failures - Fallback mechanisms for service unavailability - Detailed error logging and context tracking **Performance Optimizations**: - Embedding caching to avoid redundant computations - Adaptive vector search (disabled for short queries) - Batch processing for data indexing - Connection pooling for database operations **Security & Multi-Tenancy**: - Strict tenant isolation via `tenant_id` filtering - Rate limiting with SlowAPI integration - CORS and security headers middleware - Request context logging for auditing **Monitoring & Observability**: - Structured logging with request tracing - Health check endpoints for all dependencies - Performance metrics and timing information - Index statistics and document counts ### Data Model Insights **Key Design Decisions**: 1. **SPU over SKU Indexing**: Each ES document represents a product (SPU) with nested SKUs - Reduces index size and improves search performance - Maintains variant information through nested structures 2. **Bilingual Field Strategy**: Separate `*_zh` and `*_en` fields - Enables language-specific analyzer configuration - Provides fallback mechanisms for missing translations 3. **Nested vs Flat Design**: Strategic use of nested vs flattened fields - `specifications` and `skus`: Nested for complex queries - `min_price`, `total_inventory`: Flattened for filtering/sorting 4. **Vector Field Isolation**: Embedding fields only used for search - Not returned in API responses (index: false where appropriate) - Reduces network payload and improves response times ### AI/ML Integration Details **Text Embedding Pipeline**: - **Model**: BGE-M3 (`Xorbits/bge-m3`) - **Dimensions**: 1024-dimensional vectors - **Hardware**: GPU/CUDA acceleration with CPU fallback - **Caching**: Redis-based caching for common queries - **Usage**: Semantic search combined with BM25 relevance **Image Search Pipeline**: - **Model**: CN-CLIP (ViT-H-14) - **Processing**: URL download → preprocessing → vectorization - **Storage**: Nested structure with vector + original URL - **Application**: Visual similarity search for products **Translation Integration**: - **Service**: DeepL API with configurable auth - **Languages**: Chinese ↔ English ↔ Russian support - **Caching**: Translation result caching - **Fallback**: Mock mode returns original text if API unavailable ## Development & Deployment **Environment Configuration**: ```bash # Core Services ./run.sh # Start all services ./scripts/start_backend.sh # Backend only (port 6002) ./scripts/start_frontend.sh # Frontend UI (port 6003) # Data Operations ./scripts/ingest.sh [recreate] # Index data ./scripts/mock_data.sh # Generate test data # Testing python -m pytest tests/ # Full test suite python main.py search "query" --tenant-id 1 # Quick search test ``` **Key Files for Configuration**: - `config/config.yaml`: Search behavior configuration - `mappings/search_products.json`: ES index structure - `.env`: Environment variables and secrets - `api/models.py`: Pydantic request/response models **Common Development Tasks**: 1. **Modifying Search Behavior**: Edit `config/config.yaml` 2. **Changing Index Structure**: Update `mappings/search_products.json` 3. **Adding New Filters**: Extend `api/models.py` with new Pydantic models 4. **Updating Ranking**: Modify `ranking.expression` in config 5. **Testing Queries**: Use frontend UI at http://localhost:6003 ## Key Implementation Details 1. **Environment Variables**: All sensitive configuration in `.env` (template: `.env.example`) 2. **Configuration Management**: Centralized YAML config with validation 3. **Error Handling**: Comprehensive exception handling with proper HTTP status codes 4. **Performance**: Batch processing, embedding caching, connection pooling 5. **Logging**: Structured logging with request tracing and context 6. **Security**: Tenant isolation, rate limiting, CORS, security headers 7. **API Documentation**: Auto-generated FastAPI docs at `/docs` 8. **Multi-tenant Architecture**: Single index with `tenant_id` isolation 9. **Hybrid Search**: BM25 + vector similarity with configurable weighting 10. **Production Ready**: Health checks, monitoring, graceful degradation