CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a Search Engine SaaS platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities.
Tech Stack:
- Elasticsearch 8.x as the search engine backend
- MySQL (Shoplazza database) as the primary data source
- Python 3.10 with PyTorch/CUDA support
- BGE-M3 model for text embeddings (1024-dim vectors)
- CN-CLIP (ViT-H-14) for image embeddings
- FastAPI for REST API layer
Development Environment
Required Environment Setup:
source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine
Database Configuration:
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R
Common Development Commands
Environment Setup
# Complete environment setup
./setup.sh
# Install Python dependencies
pip install -r requirements.txt
Data Management
# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh
# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50
Running Services
# Start all services (production)
./run.sh
# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload
# Start frontend debugging UI
./scripts/start_frontend.sh
Testing
# Run all tests
python -m pytest tests/
# Run specific test types
python -m pytest tests/unit/ # Unit tests
python -m pytest tests/integration/ # Integration tests
python -m pytest -m "api" # API tests only
# Test search from command line
python main.py search "query" --tenant-id 1 --size 10
Development Utilities
# Stop all services
./scripts/stop.sh
# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh
# Install server dependencies
./scripts/install_server_deps.sh
Architecture Overview
Core Components
/data/tw/SearchEngine/
├── api/ # FastAPI REST API service (port 6002)
├── config/ # Configuration management system
├── indexer/ # MySQL → Elasticsearch data pipeline
├── search/ # Search engine and ranking logic
├── query/ # Query parsing, translation, rewriting
├── embeddings/ # ML models (BGE-M3, CN-CLIP)
├── scripts/ # Automation and utility scripts
├── utils/ # Shared utilities (ES client, etc.)
├── frontend/ # Simple debugging UI
├── mappings/ # Elasticsearch index mappings
└── tests/ # Unit and integration tests
Data Flow Architecture
Pipeline: MySQL → Indexer → Elasticsearch → API → Frontend
Data Source Layer:
- Shoplazza MySQL database with
shoplazza_product_skuandshoplazza_product_sputables - Tenant-specific extension tables for custom attributes and multi-language fields
- Shoplazza MySQL database with
Indexing Layer (
indexer/):- Reads from MySQL, applies transformations with embeddings
- Uses
DataTransformerandIndexingPipelinefor batch processing - Supports both full and incremental indexing with embedding caching
Query Processing Layer (
query/):QueryParser: Handles query rewriting, translation, and text embedding conversion- Multi-language support with automatic detection and translation
- Boolean logic parsing with operator precedence:
()>ANDNOT>AND>OR>RANK
Search Engine Layer (
search/):Searcher: Executes hybrid searches combining BM25 and dense vectors- Configurable ranking expressions with function_score support
- Multi-tenant isolation via
tenant_idfield
API Layer (
api/):- FastAPI service on port 6002 with multi-tenant support
- Text search:
POST /search/ - Image search:
POST /image-search/ - Tenant identification via
X-Tenant-IDheader
Multi-Tenant Configuration System
The system uses centralized configuration through config/config.yaml:
Field Configuration (
config/field_types.py):- Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
- Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
- Required fields and preprocessing rules
Index Configuration (
mappings/search_products.json):- Unified index structure shared by all tenants
- Elasticsearch field mappings and analyzer configurations
- BM25 similarity with modified parameters (
b=0.0, k1=0.0)
Query Configuration (
search/query_config.py):- Query domain definitions (default, category_name, title, brand_name, etc.)
- Ranking expressions and function_score configurations
- Translation and embedding settings
Embedding Models
Text Embedding (embeddings/bge_encoder.py):
- Uses BGE-M3 model (
Xorbits/bge-m3) - Singleton pattern with thread-safe initialization
- Generates 1024-dimensional vectors with GPU/CUDA support
- Configurable caching to avoid recomputation
Image Embedding (embeddings/clip_encoder.py):
- Uses CN-CLIP model (ViT-H-14)
- Downloads and preprocesses images from URLs
- Supports both local and remote image processing
- Generates 1024-dimensional vectors
Search and Ranking
Hybrid Search Approach:
- Combines traditional BM25 text relevance with dense vector similarity
- Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
- Configurable ranking expressions like:
static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)
Boolean Search Support:
- Full boolean logic with AND, OR, ANDNOT, RANK operators
- Parentheses for complex query structures
- Configurable operator precedence
Faceted Search:
- Terms and range faceting support
- Multi-dimensional filtering capabilities
- Configurable facet fields and aggregations
Testing Infrastructure
Test Framework: pytest with async support
Test Structure:
tests/conftest.py: Comprehensive test fixtures and configurationtests/unit/: Unit tests for individual componentstests/integration/: Integration tests for system workflows- Test markers:
@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.api
Test Data:
- Tenant1: Mock data with 10,000 product records
- Tenant2: CSV-based test dataset
- Automated test data generation via
scripts/mock_data.sh
Key Test Fixtures (from conftest.py):
sample_search_config: Complete configuration for testingmock_es_client: Mocked Elasticsearch clienttest_searcher: Searcher instance with mock dependenciestemp_config_file: Temporary YAML configuration for tests
API Endpoints
Main API (FastAPI on port 6002):
POST /search/- Text search with multi-language supportPOST /image-search/- Image search using CN-CLIP embeddings- Health check and management endpoints
- Multi-tenant support via
X-Tenant-IDheader
API Features:
- Hybrid search combining text and vector similarity
- Configurable ranking and filtering
- Faceted search with aggregations
- Multi-language query processing and translation
- Real-time search with configurable result sizes
Key Implementation Details
- Environment Variables: All sensitive configuration stored in
.env(template:.env.example) - Configuration Management: Dynamic config loading through
config/config_loader.py - Error Handling: Comprehensive error handling with proper HTTP status codes
- Performance: Batch processing for indexing, embedding caching, and connection pooling
- Logging: Structured logging with request tracing for debugging
- Security: Tenant isolation at the index level with proper access controls
Database Tables
Main Tables:
shoplazza_product_sku- SKU level product data with pricing and inventoryshoplazza_product_spu- SPU level product data with categories and attributes- Tenant extension tables for custom fields and multi-language content
Data Processing:
- Full data sync handled by separate Java project (not in this repo)
- This repository includes test implementations for development and debugging
- Extension tables joined with main tables during indexing process