CLAUDE.md 20.6 KB
Edit Raw Blame History


CLAUDE.md
This file provides comprehensive guidance for Claude Code (claude.ai/code) when working with this Search Engine codebase.
Project Overview
This is a production-ready Multi-Tenant E-Commerce Search SaaS platform specifically designed for Shoplazza (店匠) independent sites. It's a sophisticated search system that combines traditional keyword-based search with modern AI/ML capabilities, serving multiple tenants from a unified infrastructure.

Core Architecture Philosophy:


Unified Multi-Tenant Design: Single Elasticsearch index with tenant isolation via tenant_id
Hybrid Search Engine: BM25 text relevance + Dense vector similarity (BGE-M3)
SPU-Centric Indexing: Product-level indexing with nested SKU structures
Production-Grade: Comprehensive error handling, monitoring, and operational features


Tech Stack:


Search Backend: Elasticsearch 8.x with custom BM25 similarity (b=0.0, k1=0.0)
Data Source: MySQL (Shoplazza database) with custom data transformers
Backend Framework: FastAPI with async support and comprehensive middleware
ML/AI Models: BGE-M3 for text embeddings (1024-dim), CN-CLIP for image embeddings (1024-dim)
Language Processing: Multi-language support (Chinese, English, Russian) with DeepL API
API Layer: RESTful FastAPI service on port 6002 with auto-generated documentation
Frontend: Debugging UI on port 6003 with real-time search capabilities

Development Environment
Required Environment Setup:
source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine


Database Configuration:
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R


Service Endpoints:


Backend API: http://localhost:6002 (FastAPI)
Frontend UI: http://localhost:6003 (Debug interface)
Elasticsearch: http://localhost:9200
API Documentation: http://localhost:6002/docs

Common Development Commands
Environment Setup
# Complete environment setup
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

Data Management
# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh

# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50

Running Services
# Start all services (production)
./run.sh

# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload

# Start frontend debugging UI
./scripts/start_frontend.sh

Testing
# Run all tests
python -m pytest tests/

# Run specific test types
python -m pytest tests/unit/          # Unit tests
python -m pytest tests/integration/   # Integration tests
python -m pytest -m "api"             # API tests only

# Test search from command line
python main.py search "query" --tenant-id 1 --size 10

Development Utilities
# Stop all services
./scripts/stop.sh

# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh

# Install server dependencies
./scripts/install_server_deps.sh

Architecture Overview
Core Components
/data/tw/SearchEngine/
├── api/              # FastAPI REST API service (port 6002)
├── config/           # Configuration management system
├── indexer/          # MySQL → Elasticsearch data pipeline
├── search/           # Search engine and ranking logic
├── query/            # Query parsing, translation, rewriting
├── embeddings/       # ML models (BGE-M3, CN-CLIP)
├── scripts/          # Automation and utility scripts
├── utils/            # Shared utilities (ES client, etc.)
├── frontend/         # Simple debugging UI
├── mappings/         # Elasticsearch index mappings
└── tests/            # Unit and integration tests

Data Flow Architecture
Pipeline: MySQL → Indexer → Elasticsearch → API → Frontend


Data Source Layer:


Shoplazza MySQL database with shoplazza_product_sku and shoplazza_product_spu tables
Tenant-specific extension tables for custom attributes and multi-language fields

Indexing Layer (indexer/):


Reads from MySQL, applies transformations with embeddings
Uses DataTransformer and IndexingPipeline for batch processing
Supports both full and incremental indexing with embedding caching

Query Processing Layer (query/):


QueryParser: Handles query rewriting, translation, and text embedding conversion
Multi-language support with automatic detection and translation
Boolean logic parsing with operator precedence: () > ANDNOT > AND > OR > RANK

Search Engine Layer (search/):


Searcher: Executes hybrid searches combining BM25 and dense vectors
Configurable ranking expressions with function_score support
Multi-tenant isolation via tenant_id field

API Layer (api/):


FastAPI service on port 6002 with multi-tenant support
Text search: POST /search/
Image search: POST /image-search/
Tenant identification via X-Tenant-ID header


Multi-Tenant Configuration System
The system uses centralized configuration through config/config.yaml:


Field Configuration (config/field_types.py):


Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
Required fields and preprocessing rules

Index Configuration (mappings/search_products.json):


Unified index structure shared by all tenants
Elasticsearch field mappings and analyzer configurations
BM25 similarity with modified parameters (b=0.0, k1=0.0)

Query Configuration (search/query_config.py):


Query domain definitions (default, category_name, title, brand_name, etc.)
Ranking expressions and function_score configurations
Translation and embedding settings


Embedding Models
Text Embedding (embeddings/bge_encoder.py):


Uses BGE-M3 model (Xorbits/bge-m3)
Singleton pattern with thread-safe initialization
Generates 1024-dimensional vectors with GPU/CUDA support
Configurable caching to avoid recomputation


Image Embedding (embeddings/clip_encoder.py):


Uses CN-CLIP model (ViT-H-14)
Downloads and preprocesses images from URLs
Supports both local and remote image processing
Generates 1024-dimensional vectors

Search and Ranking
Hybrid Search Approach:


Combines traditional BM25 text relevance with dense vector similarity
Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
Configurable ranking expressions like: static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)


Boolean Search Support:


Full boolean logic with AND, OR, ANDNOT, RANK operators
Parentheses for complex query structures
Configurable operator precedence


Faceted Search:


Terms and range faceting support
Multi-dimensional filtering capabilities
Configurable facet fields and aggregations

Testing Infrastructure
Test Framework: pytest with async support

Test Structure:


tests/conftest.py: Comprehensive test fixtures and configuration
tests/unit/: Unit tests for individual components
tests/integration/: Integration tests for system workflows
Test markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.api


Test Data:


Tenant1: Mock data with 10,000 product records
Tenant2: CSV-based test dataset
Automated test data generation via scripts/mock_data.sh


Key Test Fixtures (from conftest.py):


sample_search_config: Complete configuration for testing
mock_es_client: Mocked Elasticsearch client
test_searcher: Searcher instance with mock dependencies
temp_config_file: Temporary YAML configuration for tests

API Endpoints
Main API (FastAPI on port 6002):


POST /search/ - Text search with multi-language support
POST /image-search/ - Image search using CN-CLIP embeddings
Health check and management endpoints
Multi-tenant support via X-Tenant-ID header


API Features:


Hybrid search combining text and vector similarity
Configurable ranking and filtering
Faceted search with aggregations
Multi-language query processing and translation
Real-time search with configurable result sizes

Core System Architecture & Design
Unified Multi-Tenant Index Structure
Index Design Philosophy: Single search_products index shared by all tenants with data isolation via tenant_id field.

Key Benefits:


Resource efficiency and cost optimization
Simplified maintenance and operations
Better query performance through optimized sharding
Easier cross-tenant analytics and monitoring

Index Structure (from mappings/search_products.json)
Core Document Structure (SPU-level):
{
  "tenant_id": "keyword",           // Multi-tenant isolation
  "spu_id": "keyword",              // Product identifier
  "title_zh/en": "text",            // Multi-language titles
  "brief_zh/en": "text",            // Short descriptions
  "description_zh/en": "text",      // Detailed descriptions
  "vendor_zh/en": "text",           // Supplier/brand with keyword subfield
  "category_path_zh/en": "text",    // Hierarchical category paths
  "category_name_zh/en": "text",    // Category names for search
  "category1/2/3_name": "keyword",  // Multi-level category filtering
  "tags": "keyword",                // Product tags
  "specifications": "nested",       // Product variants (color, size, etc.)
  "skus": "nested",                 // Detailed SKU information
  "min_price/max_price": "float",   // Price range calculations
  "title_embedding": "dense_vector", // 1024-dim semantic vectors
  "image_embedding": "nested",      // Image vectors for visual search
  "total_inventory": "long"         // Aggregate inventory
}


Analyzers Configuration:


Chinese fields: hanlp_index (indexing) / hanlp_standard (searching)
English fields: english analyzer
BM25 Similarity: Custom parameters (b=0.0, k1=0.0) for optimized scoring

Data Source Architecture (from 索引字段说明v2-参考表结构.md)
Primary MySQL Tables:

shoplazza_product_spu (Product Level):
- id, shop_id, shoplazza_id, handle
- title, brief, description, vendor
- category, category_id, category_level, category_path
- image_src, image_width, image_height
- tags, fake_sales, published
- inventory_policy, inventory_quantity
- seo_title, seo_description, seo_keywords
- tenant_id, create_time, update_time


shoplazza_product_sku (Variant Level):
- id, spu_id, shop_id, shoplazza_id
- title, sku, barcode
- price, compare_at_price, cost_price
- option1, option2, option3 (variant values)
- inventory_quantity, weight, weight_unit
- image_src, wholesale_price, extend
- tenant_id, create_time, update_time


Data Transformation Pipeline:


SPU-Centric Aggregation: Group SKUs under parent SPU
Multi-Language Field Mapping: MySQL → ES bilingual fields
Category Path Parsing: Extract hierarchical categories
Specifications Building: Create nested variant structures
Price Range Calculation: min/max across all SKUs
Vector Generation: BGE-M3 embeddings for titles

Advanced Search Configuration (from config/config.yaml)
Field Boost Configuration:
field_boosts:
  title_zh/en: 3.0              # Highest priority
  brief_zh/en: 1.5              # Medium priority
  description_zh/en: 1.0        # Lower priority
  vendor_zh/en: 1.5             # Brand emphasis
  category_path_zh/en: 1.5      # Category relevance
  tags: 1.0                     # Tag matching


Search Domains:


default: Comprehensive search across all text fields
title: Title-focused search (boost: 2.0)
vendor: Brand-specific search (boost: 1.5)
category: Category-focused search (boost: 1.5)
tags: Tag-based search (boost: 1.0)


Query Processing Features:
query_config:
  supported_languages: ["zh", "en", "ru"]
  enable_translation: true        # DeepL API integration
  enable_text_embedding: true     # BGE-M3 vector search
  enable_query_rewrite: true      # Dictionary-based expansion
  embedding_disable_thresholds:
    chinese_char_limit: 4        # Short query optimization
    english_word_limit: 3


Ranking Formula:
bm25() + 0.2*text_embedding_relevance()

Sophisticated Query Processing Pipeline
Multi-Language Search Architecture:


Query Normalization: Clean and standardize input
Language Detection: Automatic identification (zh/en/ru)
Query Rewriting: Dictionary-based expansion and synonyms
Translation Service: DeepL API for cross-language search
Vector Generation: BGE-M3 embeddings for semantic search
Boolean Parsing: Complex expression evaluation


Boolean Expression Support:


Operators: AND, OR, ANDNOT, RANK, parentheses
Precedence: () > ANDNOT > AND > OR > RANK
Example: 玩具 AND (乐高 OR 芭比) ANDNOT 电动

E-Commerce Specialized Features
Specifications System (Product Variants):
"specifications": [
  {
    "sku_id": "sku_123",
    "name": "color",
    "value": "white"
  },
  {
    "sku_id": "sku_123",
    "name": "size",
    "value": "256GB"
  }
]


Advanced Filtering Logic:


Different dimensions (different name): AND relationship
Same dimension (same name): OR relationship
Example: (color=white OR color=black) AND size=256GB


Faceted Search Capabilities:


Category Faceting: Multi-level category aggregations
Specifications Faceting: Nested aggregations by variant name
Range Faceting: Price ranges, date ranges
Multi-Select Support: Disjunctive faceting for filters


SKU Filtering System:


Dimension-based Grouping: Filter by option1/2/3 or specification names
Application Layer: Performance-optimized filtering outside ES
Use Case: Display one SKU per variant combination (e.g., one per color)

API Architecture & Usage (from 搜索API对接指南.md)
Core API Endpoints:
POST /search/                    # Main text search
POST /search/image              # Image search (CN-CLIP)
GET /search/{doc_id}            # Document retrieval
GET /admin/health              # Health check
GET /admin/config              # Configuration info
GET /admin/stats               # Index statistics


Request Structure:
{
  "query": "string (required)",
  "size": 10, "from": 0,
  "language": "zh|en",
  "filters": {}, "range_filters": {},
  "facets": [],
  "sort_by": "price|sales|create_time",
  "sort_order": "asc|desc",
  "sku_filter_dimension": ["color", "size"],
  "min_score": 0.0,
  "debug": false
}


Advanced Filter Examples:

Specifications Filtering:
{
  "filters": {
    "specifications": {
      "name": "color",
      "value": "white"
    }
  }
}


Multi-Dimension Specifications:
{
  "filters": {
    "specifications": [
      {"name": "color", "value": "white"},
      {"name": "size", "value": "256GB"}
    ]
  }
}


Range Filtering:
{
  "range_filters": {
    "min_price": {"gte": 50, "lte": 200},
    "create_time": {"gte": "2024-01-01T00:00:00Z"}
  }
}


Faceted Search Configuration:
{
  "facets": [
    {"field": "category1_name", "size": 15, "type": "terms"},
    {"field": "specifications.color", "size": 20, "type": "terms"},
    {"field": "min_price", "type": "range", "ranges": [...]}
  ]
}

Multi-Select Faceting (NEW FEATURE)
Standard Mode (multi_select: false):


Behavior: Selected value becomes the only option shown
Use Case: Hierarchical category navigation
Example: Toys → Dolls → Barbie


Multi-Select Mode (multi_select: true) ⭐:


Behavior: All options remain visible after selection
Use Case: Colors, brands, sizes (switchable attributes)
Example: Select "red" but still see "blue", "green", etc.


Recommended Configuration:
| Facet Type | Multi-Select | Reason |
|-----------|-------------|---------|
| Color | true | Users need to switch colors |
| Brand | true | Users need to compare brands |
| Size | true | Users need to check other sizes |
| Category | false | Hierarchical navigation |
| Price Range | false | Mutually exclusive ranges |
Production Features & Operations
Comprehensive Error Handling:


Graceful degradation for model failures
Fallback mechanisms for service unavailability
Detailed error logging and context tracking


Performance Optimizations:


Embedding caching to avoid redundant computations
Adaptive vector search (disabled for short queries)
Batch processing for data indexing
Connection pooling for database operations


Security & Multi-Tenancy:


Strict tenant isolation via tenant_id filtering
Rate limiting with SlowAPI integration
CORS and security headers middleware
Request context logging for auditing


Monitoring & Observability:


Structured logging with request tracing
Health check endpoints for all dependencies
Performance metrics and timing information
Index statistics and document counts

Data Model Insights
Key Design Decisions:


SPU over SKU Indexing: Each ES document represents a product (SPU) with nested SKUs


Reduces index size and improves search performance
Maintains variant information through nested structures

Bilingual Field Strategy: Separate *_zh and *_en fields


Enables language-specific analyzer configuration
Provides fallback mechanisms for missing translations

Nested vs Flat Design: Strategic use of nested vs flattened fields


specifications and skus: Nested for complex queries
min_price, total_inventory: Flattened for filtering/sorting

Vector Field Isolation: Embedding fields only used for search


Not returned in API responses (index: false where appropriate)
Reduces network payload and improves response times


AI/ML Integration Details
Text Embedding Pipeline:


Model: BGE-M3 (Xorbits/bge-m3)
Dimensions: 1024-dimensional vectors
Hardware: GPU/CUDA acceleration with CPU fallback
Caching: Redis-based caching for common queries
Usage: Semantic search combined with BM25 relevance


Image Search Pipeline:


Model: CN-CLIP (ViT-H-14)
Processing: URL download → preprocessing → vectorization
Storage: Nested structure with vector + original URL
Application: Visual similarity search for products


Translation Integration:


Service: DeepL API with configurable auth
Languages: Chinese ↔ English ↔ Russian support
Caching: Translation result caching
Fallback: Mock mode returns original text if API unavailable

Development & Deployment
Environment Configuration:
# Core Services
./run.sh                    # Start all services
./scripts/start_backend.sh  # Backend only (port 6002)
./scripts/start_frontend.sh # Frontend UI (port 6003)

# Data Operations
./scripts/ingest.sh <tenant_id> [recreate]  # Index data
./scripts/mock_data.sh                    # Generate test data

# Testing
python -m pytest tests/    # Full test suite
python main.py search "query" --tenant-id 1  # Quick search test


Key Files for Configuration:


config/config.yaml: Search behavior configuration
mappings/search_products.json: ES index structure
.env: Environment variables and secrets
api/models.py: Pydantic request/response models


Common Development Tasks:


Modifying Search Behavior: Edit config/config.yaml
Changing Index Structure: Update mappings/search_products.json
Adding New Filters: Extend api/models.py with new Pydantic models
Updating Ranking: Modify ranking.expression in config
Testing Queries: Use frontend UI at http://localhost:6003

Key Implementation Details

Environment Variables: All sensitive configuration in .env (template: .env.example)
Configuration Management: Centralized YAML config with validation
Error Handling: Comprehensive exception handling with proper HTTP status codes
Performance: Batch processing, embedding caching, connection pooling
Logging: Structured logging with request tracing and context
Security: Tenant isolation, rate limiting, CORS, security headers
API Documentation: Auto-generated FastAPI docs at /docs
Multi-tenant Architecture: Single index with tenant_id isolation
Hybrid Search: BM25 + vector similarity with configurable weighting
Production Ready: Health checks, monitoring, graceful degradation