CLAUDE.md 20.6 KB

CLAUDE.md

This file provides comprehensive guidance for Claude Code (claude.ai/code) when working with this Search Engine codebase.

Project Overview

This is a production-ready Multi-Tenant E-Commerce Search SaaS platform specifically designed for Shoplazza (店匠) independent sites. It's a sophisticated search system that combines traditional keyword-based search with modern AI/ML capabilities, serving multiple tenants from a unified infrastructure.

Core Architecture Philosophy:

  • Unified Multi-Tenant Design: Single Elasticsearch index with tenant isolation via tenant_id
  • Hybrid Search Engine: BM25 text relevance + Dense vector similarity (BGE-M3)
  • SPU-Centric Indexing: Product-level indexing with nested SKU structures
  • Production-Grade: Comprehensive error handling, monitoring, and operational features

Tech Stack:

  • Search Backend: Elasticsearch 8.x with custom BM25 similarity (b=0.0, k1=0.0)
  • Data Source: MySQL (Shoplazza database) with custom data transformers
  • Backend Framework: FastAPI with async support and comprehensive middleware
  • ML/AI Models: BGE-M3 for text embeddings (1024-dim), CN-CLIP for image embeddings (1024-dim)
  • Language Processing: Multi-language support (Chinese, English, Russian) with DeepL API
  • API Layer: RESTful FastAPI service on port 6002 with auto-generated documentation
  • Frontend: Debugging UI on port 6003 with real-time search capabilities

Development Environment

Required Environment Setup:

source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine

Database Configuration:

host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R

Service Endpoints:

Common Development Commands

Environment Setup

# Complete environment setup
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

Data Management

# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh

# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50

Running Services

# Start all services (production)
./run.sh

# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload

# Start frontend debugging UI
./scripts/start_frontend.sh

Testing

# Run all tests
python -m pytest tests/

# Run specific test types
python -m pytest tests/unit/          # Unit tests
python -m pytest tests/integration/   # Integration tests
python -m pytest -m "api"             # API tests only

# Test search from command line
python main.py search "query" --tenant-id 1 --size 10

Development Utilities

# Stop all services
./scripts/stop.sh

# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh

# Install server dependencies
./scripts/install_server_deps.sh

Architecture Overview

Core Components

/data/tw/SearchEngine/
├── api/              # FastAPI REST API service (port 6002)
├── config/           # Configuration management system
├── indexer/          # MySQL → Elasticsearch data pipeline
├── search/           # Search engine and ranking logic
├── query/            # Query parsing, translation, rewriting
├── embeddings/       # ML models (BGE-M3, CN-CLIP)
├── scripts/          # Automation and utility scripts
├── utils/            # Shared utilities (ES client, etc.)
├── frontend/         # Simple debugging UI
├── mappings/         # Elasticsearch index mappings
└── tests/            # Unit and integration tests

Data Flow Architecture

Pipeline: MySQL → Indexer → Elasticsearch → API → Frontend

  1. Data Source Layer:

    • Shoplazza MySQL database with shoplazza_product_sku and shoplazza_product_spu tables
    • Tenant-specific extension tables for custom attributes and multi-language fields
  2. Indexing Layer (indexer/):

    • Reads from MySQL, applies transformations with embeddings
    • Uses DataTransformer and IndexingPipeline for batch processing
    • Supports both full and incremental indexing with embedding caching
  3. Query Processing Layer (query/):

    • QueryParser: Handles query rewriting, translation, and text embedding conversion
    • Multi-language support with automatic detection and translation
    • Boolean logic parsing with operator precedence: () > ANDNOT > AND > OR > RANK
  4. Search Engine Layer (search/):

    • Searcher: Executes hybrid searches combining BM25 and dense vectors
    • Configurable ranking expressions with function_score support
    • Multi-tenant isolation via tenant_id field
  5. API Layer (api/):

    • FastAPI service on port 6002 with multi-tenant support
    • Text search: POST /search/
    • Image search: POST /image-search/
    • Tenant identification via X-Tenant-ID header

Multi-Tenant Configuration System

The system uses centralized configuration through config/config.yaml:

  1. Field Configuration (config/field_types.py):

    • Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
    • Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
    • Required fields and preprocessing rules
  2. Index Configuration (mappings/search_products.json):

    • Unified index structure shared by all tenants
    • Elasticsearch field mappings and analyzer configurations
    • BM25 similarity with modified parameters (b=0.0, k1=0.0)
  3. Query Configuration (search/query_config.py):

    • Query domain definitions (default, category_name, title, brand_name, etc.)
    • Ranking expressions and function_score configurations
    • Translation and embedding settings

Embedding Models

Text Embedding (embeddings/bge_encoder.py):

  • Uses BGE-M3 model (Xorbits/bge-m3)
  • Singleton pattern with thread-safe initialization
  • Generates 1024-dimensional vectors with GPU/CUDA support
  • Configurable caching to avoid recomputation

Image Embedding (embeddings/clip_encoder.py):

  • Uses CN-CLIP model (ViT-H-14)
  • Downloads and preprocesses images from URLs
  • Supports both local and remote image processing
  • Generates 1024-dimensional vectors

Search and Ranking

Hybrid Search Approach:

  • Combines traditional BM25 text relevance with dense vector similarity
  • Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
  • Configurable ranking expressions like: static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)

Boolean Search Support:

  • Full boolean logic with AND, OR, ANDNOT, RANK operators
  • Parentheses for complex query structures
  • Configurable operator precedence

Faceted Search:

  • Terms and range faceting support
  • Multi-dimensional filtering capabilities
  • Configurable facet fields and aggregations

Testing Infrastructure

Test Framework: pytest with async support

Test Structure:

  • tests/conftest.py: Comprehensive test fixtures and configuration
  • tests/unit/: Unit tests for individual components
  • tests/integration/: Integration tests for system workflows
  • Test markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.api

Test Data:

  • Tenant1: Mock data with 10,000 product records
  • Tenant2: CSV-based test dataset
  • Automated test data generation via scripts/mock_data.sh

Key Test Fixtures (from conftest.py):

  • sample_search_config: Complete configuration for testing
  • mock_es_client: Mocked Elasticsearch client
  • test_searcher: Searcher instance with mock dependencies
  • temp_config_file: Temporary YAML configuration for tests

API Endpoints

Main API (FastAPI on port 6002):

  • POST /search/ - Text search with multi-language support
  • POST /image-search/ - Image search using CN-CLIP embeddings
  • Health check and management endpoints
  • Multi-tenant support via X-Tenant-ID header

API Features:

  • Hybrid search combining text and vector similarity
  • Configurable ranking and filtering
  • Faceted search with aggregations
  • Multi-language query processing and translation
  • Real-time search with configurable result sizes

Core System Architecture & Design

Unified Multi-Tenant Index Structure

Index Design Philosophy: Single search_products index shared by all tenants with data isolation via tenant_id field.

Key Benefits:

  • Resource efficiency and cost optimization
  • Simplified maintenance and operations
  • Better query performance through optimized sharding
  • Easier cross-tenant analytics and monitoring

Index Structure (from mappings/search_products.json)

Core Document Structure (SPU-level):

{
  "tenant_id": "keyword",           // Multi-tenant isolation
  "spu_id": "keyword",              // Product identifier
  "title_zh/en": "text",            // Multi-language titles
  "brief_zh/en": "text",            // Short descriptions
  "description_zh/en": "text",      // Detailed descriptions
  "vendor_zh/en": "text",           // Supplier/brand with keyword subfield
  "category_path_zh/en": "text",    // Hierarchical category paths
  "category_name_zh/en": "text",    // Category names for search
  "category1/2/3_name": "keyword",  // Multi-level category filtering
  "tags": "keyword",                // Product tags
  "specifications": "nested",       // Product variants (color, size, etc.)
  "skus": "nested",                 // Detailed SKU information
  "min_price/max_price": "float",   // Price range calculations
  "title_embedding": "dense_vector", // 1024-dim semantic vectors
  "image_embedding": "nested",      // Image vectors for visual search
  "total_inventory": "long"         // Aggregate inventory
}

Analyzers Configuration:

  • Chinese fields: hanlp_index (indexing) / hanlp_standard (searching)
  • English fields: english analyzer
  • BM25 Similarity: Custom parameters (b=0.0, k1=0.0) for optimized scoring

Data Source Architecture (from 索引字段说明v2-参考表结构.md)

Primary MySQL Tables:

shoplazza_product_spu (Product Level):

- id, shop_id, shoplazza_id, handle
- title, brief, description, vendor
- category, category_id, category_level, category_path
- image_src, image_width, image_height
- tags, fake_sales, published
- inventory_policy, inventory_quantity
- seo_title, seo_description, seo_keywords
- tenant_id, create_time, update_time

shoplazza_product_sku (Variant Level):

- id, spu_id, shop_id, shoplazza_id
- title, sku, barcode
- price, compare_at_price, cost_price
- option1, option2, option3 (variant values)
- inventory_quantity, weight, weight_unit
- image_src, wholesale_price, extend
- tenant_id, create_time, update_time

Data Transformation Pipeline:

  1. SPU-Centric Aggregation: Group SKUs under parent SPU
  2. Multi-Language Field Mapping: MySQL → ES bilingual fields
  3. Category Path Parsing: Extract hierarchical categories
  4. Specifications Building: Create nested variant structures
  5. Price Range Calculation: min/max across all SKUs
  6. Vector Generation: BGE-M3 embeddings for titles

Advanced Search Configuration (from config/config.yaml)

Field Boost Configuration:

field_boosts:
  title_zh/en: 3.0              # Highest priority
  brief_zh/en: 1.5              # Medium priority
  description_zh/en: 1.0        # Lower priority
  vendor_zh/en: 1.5             # Brand emphasis
  category_path_zh/en: 1.5      # Category relevance
  tags: 1.0                     # Tag matching

Search Domains:

  • default: Comprehensive search across all text fields
  • title: Title-focused search (boost: 2.0)
  • vendor: Brand-specific search (boost: 1.5)
  • category: Category-focused search (boost: 1.5)
  • tags: Tag-based search (boost: 1.0)

Query Processing Features:

query_config:
  supported_languages: ["zh", "en", "ru"]
  enable_translation: true        # DeepL API integration
  enable_text_embedding: true     # BGE-M3 vector search
  enable_query_rewrite: true      # Dictionary-based expansion
  embedding_disable_thresholds:
    chinese_char_limit: 4        # Short query optimization
    english_word_limit: 3

Ranking Formula:

bm25() + 0.2*text_embedding_relevance()

Sophisticated Query Processing Pipeline

Multi-Language Search Architecture:

  1. Query Normalization: Clean and standardize input
  2. Language Detection: Automatic identification (zh/en/ru)
  3. Query Rewriting: Dictionary-based expansion and synonyms
  4. Translation Service: DeepL API for cross-language search
  5. Vector Generation: BGE-M3 embeddings for semantic search
  6. Boolean Parsing: Complex expression evaluation

Boolean Expression Support:

  • Operators: AND, OR, ANDNOT, RANK, parentheses
  • Precedence: () > ANDNOT > AND > OR > RANK
  • Example: 玩具 AND (乐高 OR 芭比) ANDNOT 电动

E-Commerce Specialized Features

Specifications System (Product Variants):

"specifications": [
  {
    "sku_id": "sku_123",
    "name": "color",
    "value": "white"
  },
  {
    "sku_id": "sku_123",
    "name": "size",
    "value": "256GB"
  }
]

Advanced Filtering Logic:

  • Different dimensions (different name): AND relationship
  • Same dimension (same name): OR relationship
  • Example: (color=white OR color=black) AND size=256GB

Faceted Search Capabilities:

  • Category Faceting: Multi-level category aggregations
  • Specifications Faceting: Nested aggregations by variant name
  • Range Faceting: Price ranges, date ranges
  • Multi-Select Support: Disjunctive faceting for filters

SKU Filtering System:

  • Dimension-based Grouping: Filter by option1/2/3 or specification names
  • Application Layer: Performance-optimized filtering outside ES
  • Use Case: Display one SKU per variant combination (e.g., one per color)

API Architecture & Usage (from 搜索API对接指南.md)

Core API Endpoints:

POST /search/                    # Main text search
POST /search/image              # Image search (CN-CLIP)
GET /search/{doc_id}            # Document retrieval
GET /admin/health              # Health check
GET /admin/config              # Configuration info
GET /admin/stats               # Index statistics

Request Structure:

{
  "query": "string (required)",
  "size": 10, "from": 0,
  "language": "zh|en",
  "filters": {}, "range_filters": {},
  "facets": [],
  "sort_by": "price|sales|create_time",
  "sort_order": "asc|desc",
  "sku_filter_dimension": ["color", "size"],
  "min_score": 0.0,
  "debug": false
}

Advanced Filter Examples:

Specifications Filtering:

{
  "filters": {
    "specifications": {
      "name": "color",
      "value": "white"
    }
  }
}

Multi-Dimension Specifications:

{
  "filters": {
    "specifications": [
      {"name": "color", "value": "white"},
      {"name": "size", "value": "256GB"}
    ]
  }
}

Range Filtering:

{
  "range_filters": {
    "min_price": {"gte": 50, "lte": 200},
    "create_time": {"gte": "2024-01-01T00:00:00Z"}
  }
}

Faceted Search Configuration:

{
  "facets": [
    {"field": "category1_name", "size": 15, "type": "terms"},
    {"field": "specifications.color", "size": 20, "type": "terms"},
    {"field": "min_price", "type": "range", "ranges": [...]}
  ]
}

Multi-Select Faceting (NEW FEATURE)

Standard Mode (multi_select: false):

  • Behavior: Selected value becomes the only option shown
  • Use Case: Hierarchical category navigation
  • Example: Toys → Dolls → Barbie

Multi-Select Mode (multi_select: true) ⭐:

  • Behavior: All options remain visible after selection
  • Use Case: Colors, brands, sizes (switchable attributes)
  • Example: Select "red" but still see "blue", "green", etc.

Recommended Configuration: | Facet Type | Multi-Select | Reason | |-----------|-------------|---------| | Color | true | Users need to switch colors | | Brand | true | Users need to compare brands | | Size | true | Users need to check other sizes | | Category | false | Hierarchical navigation | | Price Range | false | Mutually exclusive ranges |

Production Features & Operations

Comprehensive Error Handling:

  • Graceful degradation for model failures
  • Fallback mechanisms for service unavailability
  • Detailed error logging and context tracking

Performance Optimizations:

  • Embedding caching to avoid redundant computations
  • Adaptive vector search (disabled for short queries)
  • Batch processing for data indexing
  • Connection pooling for database operations

Security & Multi-Tenancy:

  • Strict tenant isolation via tenant_id filtering
  • Rate limiting with SlowAPI integration
  • CORS and security headers middleware
  • Request context logging for auditing

Monitoring & Observability:

  • Structured logging with request tracing
  • Health check endpoints for all dependencies
  • Performance metrics and timing information
  • Index statistics and document counts

Data Model Insights

Key Design Decisions:

  1. SPU over SKU Indexing: Each ES document represents a product (SPU) with nested SKUs

    • Reduces index size and improves search performance
    • Maintains variant information through nested structures
  2. Bilingual Field Strategy: Separate *_zh and *_en fields

    • Enables language-specific analyzer configuration
    • Provides fallback mechanisms for missing translations
  3. Nested vs Flat Design: Strategic use of nested vs flattened fields

    • specifications and skus: Nested for complex queries
    • min_price, total_inventory: Flattened for filtering/sorting
  4. Vector Field Isolation: Embedding fields only used for search

    • Not returned in API responses (index: false where appropriate)
    • Reduces network payload and improves response times

AI/ML Integration Details

Text Embedding Pipeline:

  • Model: BGE-M3 (Xorbits/bge-m3)
  • Dimensions: 1024-dimensional vectors
  • Hardware: GPU/CUDA acceleration with CPU fallback
  • Caching: Redis-based caching for common queries
  • Usage: Semantic search combined with BM25 relevance

Image Search Pipeline:

  • Model: CN-CLIP (ViT-H-14)
  • Processing: URL download → preprocessing → vectorization
  • Storage: Nested structure with vector + original URL
  • Application: Visual similarity search for products

Translation Integration:

  • Service: DeepL API with configurable auth
  • Languages: Chinese ↔ English ↔ Russian support
  • Caching: Translation result caching
  • Fallback: Mock mode returns original text if API unavailable

Development & Deployment

Environment Configuration:

# Core Services
./run.sh                    # Start all services
./scripts/start_backend.sh  # Backend only (port 6002)
./scripts/start_frontend.sh # Frontend UI (port 6003)

# Data Operations
./scripts/ingest.sh <tenant_id> [recreate]  # Index data
./scripts/mock_data.sh                    # Generate test data

# Testing
python -m pytest tests/    # Full test suite
python main.py search "query" --tenant-id 1  # Quick search test

Key Files for Configuration:

  • config/config.yaml: Search behavior configuration
  • mappings/search_products.json: ES index structure
  • .env: Environment variables and secrets
  • api/models.py: Pydantic request/response models

Common Development Tasks:

  1. Modifying Search Behavior: Edit config/config.yaml
  2. Changing Index Structure: Update mappings/search_products.json
  3. Adding New Filters: Extend api/models.py with new Pydantic models
  4. Updating Ranking: Modify ranking.expression in config
  5. Testing Queries: Use frontend UI at http://localhost:6003

Key Implementation Details

  1. Environment Variables: All sensitive configuration in .env (template: .env.example)
  2. Configuration Management: Centralized YAML config with validation
  3. Error Handling: Comprehensive exception handling with proper HTTP status codes
  4. Performance: Batch processing, embedding caching, connection pooling
  5. Logging: Structured logging with request tracing and context
  6. Security: Tenant isolation, rate limiting, CORS, security headers
  7. API Documentation: Auto-generated FastAPI docs at /docs
  8. Multi-tenant Architecture: Single index with tenant_id isolation
  9. Hybrid Search: BM25 + vector similarity with configurable weighting
  10. Production Ready: Health checks, monitoring, graceful degradation