# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a **Search Engine SaaS** project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking. **Tech Stack:** - Elasticsearch as the search engine backend - MySQL (Shoplazza database) as the primary data source - Python for data processing and ingestion - BGE-M3 model for text embeddings (1024-dim vectors) - CN-CLIP (ViT-H-14) for image embeddings ## Database Configuration **Shoplazza Production Database:** ``` host: 120.79.247.228 port: 3316 database: saas username: saas password: P89cZHS5d7dFyc9R ``` **Main Tables:** - `shoplazza_product_sku` - SKU level product data - `shoplazza_product_spu` - SPU level product data ## Architecture ### Data Flow 1. **Data Source (MySQL)** → Main tables (`shoplazza_product_sku`, `shoplazza_product_spu`) + tenant extension tables 2. **Indexer** → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch 3. **Query Parser** → Query rewriting, translation, text embedding conversion 4. **Searcher** → Executes searches against Elasticsearch with configurable ranking ### Multi-Tenant Design Each tenant has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with tenant extension tables during indexing. ### Configuration System The system uses two types of configurations per tenant: 1. **Application Structure Config** (`IndexerConfig`) - Defines: - Input field mappings from MySQL to Elasticsearch - Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.) - Which fields require preprocessing (embeddings, transformations) 2. **Index Structure Config** - Defines: - Elasticsearch field mappings and analyzers - Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese - Query domain definitions (default, category_name, title, brand_name, etc.) - BM25 parameters and similarity configurations Reference files: - `商品数据源入ES配置规范.md` - ES mapping and analyzer configuration standards - `阿里opensearch电商行业.md` - Application and index structure examples from Alibaba OpenSearch ### Query Processing The `queryParser` performs: 1. **Query Rewriting** - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections) 2. **Translation** - Language detection and translation to support multi-language search (e.g., zh↔en) 3. **Text Embedding** - Converts query text to vectors when vector search is enabled ### Search and Ranking The `searcher` supports: - Boolean operators: AND, OR, RANK, ANDNOT with parentheses - Operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK` - Configurable ranking expressions for the `default` domain: - Example: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)` - Combines BM25 text relevance, embedding similarity, product scores, and time decay ### Embedding Modules **Text Embedding** - Uses BGE-M3 model (`Xorbits/bge-m3`): - Singleton pattern with thread-safe initialization - Generates 1024-dimensional vectors - Configured for GPU/CUDA acceleration **Image Embedding** - Uses CN-CLIP model (ViT-H-14): - Downloads and validates images from URLs - Preprocesses images (resize, RGB conversion) - Generates 1024-dimensional vectors - Supports both local and remote images ## Test Data 、 **Tenant1 Test Dataset:** - Location: `data/tenant1/goods_with_pic.5years_congku.csv.shuf.1w` - Contains 10,000 shuffled product records with images - Processing script: `data/tenant1/task2_process_goods.py` - Extracts product data from MySQL - Maps images from filebank database - Creates inverted index (URL → SKU list) ## Key Implementation Notes 1. **Data Sync:** Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes. 2. **Extension Tables:** When designing tenant configurations, determine which fields exist in the main SKU table vs. which need to be added to tenant-specific extension tables. 3. **Embedding Caching:** For periodic full indexing, embedding results should be cached to avoid recomputation. 4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity. 5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese). - 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh && conda activate searchengine