# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a **Search Engine SaaS** platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities. **Tech Stack:** - Elasticsearch 8.x as the search engine backend - MySQL (Shoplazza database) as the primary data source - Python 3.10 with PyTorch/CUDA support - BGE-M3 model for text embeddings (1024-dim vectors) - CN-CLIP (ViT-H-14) for image embeddings - FastAPI for REST API layer ## Development Environment **Required Environment Setup:** ```bash source /home/tw/miniconda3/etc/profile.d/conda.sh conda activate searchengine ``` **Database Configuration:** ``` host: 120.79.247.228 port: 3316 database: saas username: saas password: P89cZHS5d7dFyc9R ``` ## Common Development Commands ### Environment Setup ```bash # Complete environment setup ./setup.sh # Install Python dependencies pip install -r requirements.txt ``` ### Data Management ```bash # Generate test data (Tenant1 Mock + Tenant2 CSV) ./scripts/mock_data.sh # Ingest data to Elasticsearch ./scripts/ingest.sh [recreate] # e.g., ./scripts/ingest.sh 1 true python main.py ingest data.csv --limit 1000 --batch-size 50 ``` ### Running Services ```bash # Start all services (production) ./run.sh # Start development server with auto-reload ./scripts/start_backend.sh python main.py serve --host 0.0.0.0 --port 6002 --reload # Start frontend debugging UI ./scripts/start_frontend.sh ``` ### Testing ```bash # Run all tests python -m pytest tests/ # Run specific test types python -m pytest tests/unit/ # Unit tests python -m pytest tests/integration/ # Integration tests python -m pytest -m "api" # API tests only # Test search from command line python main.py search "query" --tenant-id 1 --size 10 ``` ### Development Utilities ```bash # Stop all services ./scripts/stop.sh # Test environment (for CI/development) ./scripts/start_test_environment.sh ./scripts/stop_test_environment.sh # Install server dependencies ./scripts/install_server_deps.sh ``` ## Architecture Overview ### Core Components ``` /data/tw/SearchEngine/ ├── api/ # FastAPI REST API service (port 6002) ├── config/ # Configuration management system ├── indexer/ # MySQL → Elasticsearch data pipeline ├── search/ # Search engine and ranking logic ├── query/ # Query parsing, translation, rewriting ├── embeddings/ # ML models (BGE-M3, CN-CLIP) ├── scripts/ # Automation and utility scripts ├── utils/ # Shared utilities (ES client, etc.) ├── frontend/ # Simple debugging UI ├── mappings/ # Elasticsearch index mappings └── tests/ # Unit and integration tests ``` ### Data Flow Architecture **Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend 1. **Data Source Layer**: - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables - Tenant-specific extension tables for custom attributes and multi-language fields 2. **Indexing Layer** (`indexer/`): - Reads from MySQL, applies transformations with embeddings - Uses `DataTransformer` and `IndexingPipeline` for batch processing - Supports both full and incremental indexing with embedding caching 3. **Query Processing Layer** (`query/`): - `QueryParser`: Handles query rewriting, translation, and text embedding conversion - Multi-language support with automatic detection and translation - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK` 4. **Search Engine Layer** (`search/`): - `Searcher`: Executes hybrid searches combining BM25 and dense vectors - Configurable ranking expressions with function_score support - Multi-tenant isolation via `tenant_id` field 5. **API Layer** (`api/`): - FastAPI service on port 6002 with multi-tenant support - Text search: `POST /search/` - Image search: `POST /image-search/` - Tenant identification via `X-Tenant-ID` header ### Multi-Tenant Configuration System The system uses centralized configuration through `config/config.yaml`: 1. **Field Configuration** (`config/field_types.py`): - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc. - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese - Required fields and preprocessing rules 2. **Index Configuration** (`mappings/search_products.json`): - Unified index structure shared by all tenants - Elasticsearch field mappings and analyzer configurations - BM25 similarity with modified parameters (`b=0.0, k1=0.0`) 3. **Query Configuration** (`search/query_config.py`): - Query domain definitions (default, category_name, title, brand_name, etc.) - Ranking expressions and function_score configurations - Translation and embedding settings ### Embedding Models **Text Embedding** (`embeddings/bge_encoder.py`): - Uses BGE-M3 model (`Xorbits/bge-m3`) - Singleton pattern with thread-safe initialization - Generates 1024-dimensional vectors with GPU/CUDA support - Configurable caching to avoid recomputation **Image Embedding** (`embeddings/clip_encoder.py`): - Uses CN-CLIP model (ViT-H-14) - Downloads and preprocesses images from URLs - Supports both local and remote image processing - Generates 1024-dimensional vectors ### Search and Ranking **Hybrid Search Approach**: - Combines traditional BM25 text relevance with dense vector similarity - Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP) - Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)` **Boolean Search Support**: - Full boolean logic with AND, OR, ANDNOT, RANK operators - Parentheses for complex query structures - Configurable operator precedence **Faceted Search**: - Terms and range faceting support - Multi-dimensional filtering capabilities - Configurable facet fields and aggregations ## Testing Infrastructure **Test Framework**: pytest with async support **Test Structure**: - `tests/conftest.py`: Comprehensive test fixtures and configuration - `tests/unit/`: Unit tests for individual components - `tests/integration/`: Integration tests for system workflows - Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api` **Test Data**: - Tenant1: Mock data with 10,000 product records - Tenant2: CSV-based test dataset - Automated test data generation via `scripts/mock_data.sh` **Key Test Fixtures** (from `conftest.py`): - `sample_search_config`: Complete configuration for testing - `mock_es_client`: Mocked Elasticsearch client - `test_searcher`: Searcher instance with mock dependencies - `temp_config_file`: Temporary YAML configuration for tests ## API Endpoints **Main API** (FastAPI on port 6002): - `POST /search/` - Text search with multi-language support - `POST /image-search/` - Image search using CN-CLIP embeddings - Health check and management endpoints - Multi-tenant support via `X-Tenant-ID` header **API Features**: - Hybrid search combining text and vector similarity - Configurable ranking and filtering - Faceted search with aggregations - Multi-language query processing and translation - Real-time search with configurable result sizes ## Key Implementation Details 1. **Environment Variables**: All sensitive configuration stored in `.env` (template: `.env.example`) 2. **Configuration Management**: Dynamic config loading through `config/config_loader.py` 3. **Error Handling**: Comprehensive error handling with proper HTTP status codes 4. **Performance**: Batch processing for indexing, embedding caching, and connection pooling 5. **Logging**: Structured logging with request tracing for debugging 6. **Security**: Tenant isolation at the index level with proper access controls ## Database Tables **Main Tables**: - `shoplazza_product_sku` - SKU level product data with pricing and inventory - `shoplazza_product_spu` - SPU level product data with categories and attributes - Tenant extension tables for custom fields and multi-language content **Data Processing**: - Full data sync handled by separate Java project (not in this repo) - This repository includes test implementations for development and debugging - Extension tables joined with main tables during indexing process