CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a Search Engine SaaS project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking.
Tech Stack:
- Elasticsearch as the search engine backend
- MySQL (Shoplazza database) as the primary data source
- Python for data processing and ingestion
- BGE-M3 model for text embeddings (1024-dim vectors)
- CN-CLIP (ViT-H-14) for image embeddings
Database Configuration
Shoplazza Production Database:
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R
Main Tables:
shoplazza_product_sku- SKU level product datashoplazza_product_spu- SPU level product data
Architecture
Data Flow
- Data Source (MySQL) → Main tables (
shoplazza_product_sku,shoplazza_product_spu) + tenant extension tables - Indexer → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch
- Query Parser → Query rewriting, translation, text embedding conversion
- Searcher → Executes searches against Elasticsearch with configurable ranking
Multi-Tenant Design
Each tenant has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with tenant extension tables during indexing.
Configuration System
The system uses two types of configurations per tenant:
Application Structure Config (
IndexerConfig) - Defines:- Input field mappings from MySQL to Elasticsearch
- Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.)
- Which fields require preprocessing (embeddings, transformations)
Index Structure Config - Defines:
- Elasticsearch field mappings and analyzers
- Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
- Query domain definitions (default, category_name, title, brand_name, etc.)
- BM25 parameters and similarity configurations
Query Processing
The queryParser performs:
- Query Rewriting - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections)
- Translation - Language detection and translation to support multi-language search (e.g., zh↔en)
- Text Embedding - Converts query text to vectors when vector search is enabled
Search and Ranking
The searcher supports:
- Boolean operators: AND, OR, RANK, ANDNOT with parentheses
- Operator precedence:
()>ANDNOT>AND>OR>RANK - Configurable ranking expressions for the
defaultdomain:- Example:
static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time) - Combines BM25 text relevance, embedding similarity, product scores, and time decay
- Example:
Embedding Modules
Text Embedding - Uses BGE-M3 model (Xorbits/bge-m3):
- Singleton pattern with thread-safe initialization
- Generates 1024-dimensional vectors
- Configured for GPU/CUDA acceleration
Image Embedding - Uses CN-CLIP model (ViT-H-14):
- Downloads and validates images from URLs
- Preprocesses images (resize, RGB conversion)
- Generates 1024-dimensional vectors
- Supports both local and remote images
Test Data
、 Tenant1 Test Dataset:
- Location:
data/tenant1/goods_with_pic.5years_congku.csv.shuf.1w - Contains 10,000 shuffled product records with images
- Processing script:
data/tenant1/task2_process_goods.py- Extracts product data from MySQL
- Maps images from filebank database
- Creates inverted index (URL → SKU list)
Key Implementation Notes
Data Sync: Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes.
Extension Tables: When designing tenant configurations, determine which fields exist in the main SKU table vs. which need to be added to tenant-specific extension tables.
Embedding Caching: For periodic full indexing, embedding results should be cached to avoid recomputation.
ES Similarity Configuration: All text fields use modified BM25 with
b=0.0, k1=0.0as the default similarity.Multi-Language Support: The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese).
记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh && conda activate searchengine