# CLAUDE.md

This file provides comprehensive guidance for Claude Code (claude.ai/code) when working with this Search Engine codebase.

## Project Overview

This is a **production-ready Multi-Tenant E-Commerce Search SaaS** platform specifically designed for Shoplazza (店匠) independent sites. It's a sophisticated search system that combines traditional keyword-based search with modern AI/ML capabilities, serving multiple tenants from a unified infrastructure.

**Core Architecture Philosophy:**
- **Unified Multi-Tenant Design**: Single Elasticsearch index with tenant isolation via `tenant_id`
- **Hybrid Search Engine**: BM25 text relevance + Dense vector similarity (BGE-M3)
- **SPU-Centric Indexing**: Product-level indexing with nested SKU structures
- **Production-Grade**: Comprehensive error handling, monitoring, and operational features

**Tech Stack:**
- **Search Backend**: Elasticsearch 8.x with custom BM25 similarity (b=0.0, k1=0.0)
- **Data Source**: MySQL (Shoplazza database) with custom data transformers
- **Backend Framework**: FastAPI with async support and comprehensive middleware
- **ML/AI Models**: BGE-M3 for text embeddings (1024-dim), CN-CLIP for image embeddings (1024-dim)
- **Language Processing**: Multi-language support (Chinese, English, Russian) with DeepL API
- **API Layer**: RESTful FastAPI service on port 6002 with auto-generated documentation
- **Frontend**: Debugging UI on port 6003 with real-time search capabilities

## Development Environment

**Required Environment Setup:**
```bash
source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine
```

**Database Configuration:**
```yaml
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R
```

**Service Endpoints:**
- **Backend API**: http://localhost:6002 (FastAPI)
- **Frontend UI**: http://localhost:6003 (Debug interface)
- **Elasticsearch**: http://localhost:9200
- **API Documentation**: http://localhost:6002/docs

## Common Development Commands

### Environment Setup
```bash
# Complete environment setup
./setup.sh

# Install Python dependencies
pip install -r requirements.txt
```

### Data Management
```bash
# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh

# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50
```

### Running Services
```bash
# Start all services (production)
./run.sh

# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload

# Start frontend debugging UI
./scripts/start_frontend.sh
```

### Testing
```bash
# Run all tests
python -m pytest tests/

# Run specific test types
python -m pytest tests/unit/          # Unit tests
python -m pytest tests/integration/   # Integration tests
python -m pytest -m "api"             # API tests only

# Test search from command line
python main.py search "query" --tenant-id 1 --size 10
```

### Development Utilities
```bash
# Stop all services
./scripts/stop.sh

# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh

# Install server dependencies
./scripts/install_server_deps.sh
```

## Architecture Overview

### Core Components
```
/data/tw/SearchEngine/
├── api/              # FastAPI REST API service (port 6002)
├── config/           # Configuration management system
├── indexer/          # MySQL → Elasticsearch data pipeline
├── search/           # Search engine and ranking logic
├── query/            # Query parsing, translation, rewriting
├── embeddings/       # ML models (BGE-M3, CN-CLIP)
├── scripts/          # Automation and utility scripts
├── utils/            # Shared utilities (ES client, etc.)
├── frontend/         # Simple debugging UI
├── mappings/         # Elasticsearch index mappings
└── tests/            # Unit and integration tests
```

### Data Flow Architecture
**Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend

1. **Data Source Layer**:
   - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables
   - Tenant-specific extension tables for custom attributes and multi-language fields

2. **Indexing Layer** (`indexer/`):
   - Reads from MySQL, applies transformations with embeddings
   - Uses `DataTransformer` and `IndexingPipeline` for batch processing
   - Supports both full and incremental indexing with embedding caching

3. **Query Processing Layer** (`query/`):
   - `QueryParser`: Handles query rewriting, translation, and text embedding conversion
   - Multi-language support with automatic detection and translation
   - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`

4. **Search Engine Layer** (`search/`):
   - `Searcher`: Executes hybrid searches combining BM25 and dense vectors
   - Configurable ranking expressions with function_score support
   - Multi-tenant isolation via `tenant_id` field

5. **API Layer** (`api/`):
   - FastAPI service on port 6002 with multi-tenant support
   - Text search: `POST /search/`
   - Image search: `POST /image-search/`
   - Tenant identification via `X-Tenant-ID` header

### Multi-Tenant Configuration System

The system uses centralized configuration through `config/config.yaml`:

1. **Field Configuration** (`config/field_types.py`):
   - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
   - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
   - Required fields and preprocessing rules

2. **Index Configuration** (`mappings/search_products.json`):
   - Unified index structure shared by all tenants
   - Elasticsearch field mappings and analyzer configurations
   - BM25 similarity with modified parameters (`b=0.0, k1=0.0`)

3. **Query Configuration** (`search/query_config.py`):
   - Query domain definitions (default, category_name, title, brand_name, etc.)
   - Ranking expressions and function_score configurations
   - Translation and embedding settings

### Embedding Models

**Text Embedding** (`embeddings/bge_encoder.py`):
- Uses BGE-M3 model (`Xorbits/bge-m3`)
- Singleton pattern with thread-safe initialization
- Generates 1024-dimensional vectors with GPU/CUDA support
- Configurable caching to avoid recomputation

**Image Embedding** (`embeddings/clip_encoder.py`):
- Uses CN-CLIP model (ViT-H-14)
- Downloads and preprocesses images from URLs
- Supports both local and remote image processing
- Generates 1024-dimensional vectors

### Search and Ranking

**Hybrid Search Approach**:
- Combines traditional BM25 text relevance with dense vector similarity
- Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
- Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`

**Boolean Search Support**:
- Full boolean logic with AND, OR, ANDNOT, RANK operators
- Parentheses for complex query structures
- Configurable operator precedence

**Faceted Search**:
- Terms and range faceting support
- Multi-dimensional filtering capabilities
- Configurable facet fields and aggregations

## Testing Infrastructure

**Test Framework**: pytest with async support

**Test Structure**:
- `tests/conftest.py`: Comprehensive test fixtures and configuration
- `tests/unit/`: Unit tests for individual components
- `tests/integration/`: Integration tests for system workflows
- Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api`

**Test Data**:
- Tenant1: Mock data with 10,000 product records
- Tenant2: CSV-based test dataset
- Automated test data generation via `scripts/mock_data.sh`

**Key Test Fixtures** (from `conftest.py`):
- `sample_search_config`: Complete configuration for testing
- `mock_es_client`: Mocked Elasticsearch client
- `test_searcher`: Searcher instance with mock dependencies
- `temp_config_file`: Temporary YAML configuration for tests

## API Endpoints

**Main API** (FastAPI on port 6002):
- `POST /search/` - Text search with multi-language support
- `POST /image-search/` - Image search using CN-CLIP embeddings
- Health check and management endpoints
- Multi-tenant support via `X-Tenant-ID` header

**API Features**:
- Hybrid search combining text and vector similarity
- Configurable ranking and filtering
- Faceted search with aggregations
- Multi-language query processing and translation
- Real-time search with configurable result sizes

## Core System Architecture & Design

### Unified Multi-Tenant Index Structure

**Index Design Philosophy**: Single `search_products` index shared by all tenants with data isolation via `tenant_id` field.

**Key Benefits**:
- Resource efficiency and cost optimization
- Simplified maintenance and operations
- Better query performance through optimized sharding
- Easier cross-tenant analytics and monitoring

### Index Structure (from mappings/search_products.json)

**Core Document Structure** (SPU-level):
```json
{
  "tenant_id": "keyword",           // Multi-tenant isolation
  "spu_id": "keyword",              // Product identifier
  "title_zh/en": "text",            // Multi-language titles
  "brief_zh/en": "text",            // Short descriptions
  "description_zh/en": "text",      // Detailed descriptions
  "vendor_zh/en": "text",           // Supplier/brand with keyword subfield
  "category_path_zh/en": "text",    // Hierarchical category paths
  "category_name_zh/en": "text",    // Category names for search
  "category1/2/3_name": "keyword",  // Multi-level category filtering
  "tags": "keyword",                // Product tags
  "specifications": "nested",       // Product variants (color, size, etc.)
  "skus": "nested",                 // Detailed SKU information
  "min_price/max_price": "float",   // Price range calculations
  "title_embedding": "dense_vector", // 1024-dim semantic vectors
  "image_embedding": "nested",      // Image vectors for visual search
  "total_inventory": "long"         // Aggregate inventory
}
```

**Analyzers Configuration**:
- **Chinese fields**: `hanlp_index` (indexing) / `hanlp_standard` (searching)
- **English fields**: `english` analyzer
- **BM25 Similarity**: Custom parameters (b=0.0, k1=0.0) for optimized scoring

### Data Source Architecture (from 索引字段说明v2-参考表结构.md)

**Primary MySQL Tables**:

**shoplazza_product_spu** (Product Level):
```sql
- id, shop_id, shoplazza_id, handle
- title, brief, description, vendor
- category, category_id, category_level, category_path
- image_src, image_width, image_height
- tags, fake_sales, published
- inventory_policy, inventory_quantity
- seo_title, seo_description, seo_keywords
- tenant_id, create_time, update_time
```

**shoplazza_product_sku** (Variant Level):
```sql
- id, spu_id, shop_id, shoplazza_id
- title, sku, barcode
- price, compare_at_price, cost_price
- option1, option2, option3 (variant values)
- inventory_quantity, weight, weight_unit
- image_src, wholesale_price, extend
- tenant_id, create_time, update_time
```

**Data Transformation Pipeline**:
1. **SPU-Centric Aggregation**: Group SKUs under parent SPU
2. **Multi-Language Field Mapping**: MySQL → ES bilingual fields
3. **Category Path Parsing**: Extract hierarchical categories
4. **Specifications Building**: Create nested variant structures
5. **Price Range Calculation**: min/max across all SKUs
6. **Vector Generation**: BGE-M3 embeddings for titles

### Advanced Search Configuration (from config/config.yaml)

**Field Boost Configuration**:
```yaml
field_boosts:
  title_zh/en: 3.0              # Highest priority
  brief_zh/en: 1.5              # Medium priority
  description_zh/en: 1.0        # Lower priority
  vendor_zh/en: 1.5             # Brand emphasis
  category_path_zh/en: 1.5      # Category relevance
  tags: 1.0                     # Tag matching
```

**Search Domains**:
- **default**: Comprehensive search across all text fields
- **title**: Title-focused search (boost: 2.0)
- **vendor**: Brand-specific search (boost: 1.5)
- **category**: Category-focused search (boost: 1.5)
- **tags**: Tag-based search (boost: 1.0)

**Query Processing Features**:
```yaml
query_config:
  supported_languages: ["zh", "en", "ru"]
  enable_translation: true        # DeepL API integration
  enable_text_embedding: true     # BGE-M3 vector search
  enable_query_rewrite: true      # Dictionary-based expansion
  embedding_disable_thresholds:
    chinese_char_limit: 4        # Short query optimization
    english_word_limit: 3
```

**Ranking Formula**:
```
bm25() + 0.2*text_embedding_relevance()
```

### Sophisticated Query Processing Pipeline

**Multi-Language Search Architecture**:
1. **Query Normalization**: Clean and standardize input
2. **Language Detection**: Automatic identification (zh/en/ru)
3. **Query Rewriting**: Dictionary-based expansion and synonyms
4. **Translation Service**: DeepL API for cross-language search
5. **Vector Generation**: BGE-M3 embeddings for semantic search
6. **Boolean Parsing**: Complex expression evaluation

**Boolean Expression Support**:
- **Operators**: AND, OR, ANDNOT, RANK, parentheses
- **Precedence**: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
- **Example**: `玩具 AND (乐高 OR 芭比) ANDNOT 电动`

### E-Commerce Specialized Features

**Specifications System** (Product Variants):
```json
"specifications": [
  {
    "sku_id": "sku_123",
    "name": "color",
    "value": "white"
  },
  {
    "sku_id": "sku_123",
    "name": "size",
    "value": "256GB"
  }
]
```

**Advanced Filtering Logic**:
- **Different dimensions** (different `name`): AND relationship
- **Same dimension** (same `name`): OR relationship
- **Example**: `(color=white OR color=black) AND size=256GB`

**Faceted Search Capabilities**:
- **Category Faceting**: Multi-level category aggregations
- **Specifications Faceting**: Nested aggregations by variant name
- **Range Faceting**: Price ranges, date ranges
- **Multi-Select Support**: Disjunctive faceting for filters

**SKU Filtering System**:
- **Dimension-based Grouping**: Filter by `option1/2/3` or specification names
- **Application Layer**: Performance-optimized filtering outside ES
- **Use Case**: Display one SKU per variant combination (e.g., one per color)

### API Architecture & Usage (from 搜索API对接指南.md)

**Core API Endpoints**:
```
POST /search/                    # Main text search
POST /search/image              # Image search (CN-CLIP)
GET /search/{doc_id}            # Document retrieval
GET /admin/health              # Health check
GET /admin/config              # Configuration info
GET /admin/stats               # Index statistics
```

**Request Structure**:
```json
{
  "query": "string (required)",
  "size": 10, "from": 0,
  "language": "zh|en",
  "filters": {}, "range_filters": {},
  "facets": [],
  "sort_by": "price|sales|create_time",
  "sort_order": "asc|desc",
  "sku_filter_dimension": ["color", "size"],
  "min_score": 0.0,
  "debug": false
}
```

**Advanced Filter Examples**:

**Specifications Filtering**:
```json
{
  "filters": {
    "specifications": {
      "name": "color",
      "value": "white"
    }
  }
}
```

**Multi-Dimension Specifications**:
```json
{
  "filters": {
    "specifications": [
      {"name": "color", "value": "white"},
      {"name": "size", "value": "256GB"}
    ]
  }
}
```

**Range Filtering**:
```json
{
  "range_filters": {
    "min_price": {"gte": 50, "lte": 200},
    "create_time": {"gte": "2024-01-01T00:00:00Z"}
  }
}
```

**Faceted Search Configuration**:
```json
{
  "facets": [
    {"field": "category1_name", "size": 15, "type": "terms"},
    {"field": "specifications.color", "size": 20, "type": "terms"},
    {"field": "min_price", "type": "range", "ranges": [...]}
  ]
}
```

### Multi-Select Faceting (NEW FEATURE)

**Standard Mode** (`multi_select: false`):
- Behavior: Selected value becomes the only option shown
- Use Case: Hierarchical category navigation
- Example: Toys → Dolls → Barbie

**Multi-Select Mode** (`multi_select: true`) ⭐:
- Behavior: All options remain visible after selection
- Use Case: Colors, brands, sizes (switchable attributes)
- Example: Select "red" but still see "blue", "green", etc.

**Recommended Configuration**:
| Facet Type | Multi-Select | Reason |
|-----------|-------------|---------|
| Color | `true` | Users need to switch colors |
| Brand | `true` | Users need to compare brands |
| Size | `true` | Users need to check other sizes |
| Category | `false` | Hierarchical navigation |
| Price Range | `false` | Mutually exclusive ranges |

### Production Features & Operations

**Comprehensive Error Handling**:
- Graceful degradation for model failures
- Fallback mechanisms for service unavailability
- Detailed error logging and context tracking

**Performance Optimizations**:
- Embedding caching to avoid redundant computations
- Adaptive vector search (disabled for short queries)
- Batch processing for data indexing
- Connection pooling for database operations

**Security & Multi-Tenancy**:
- Strict tenant isolation via `tenant_id` filtering
- Rate limiting with SlowAPI integration
- CORS and security headers middleware
- Request context logging for auditing

**Monitoring & Observability**:
- Structured logging with request tracing
- Health check endpoints for all dependencies
- Performance metrics and timing information
- Index statistics and document counts

### Data Model Insights

**Key Design Decisions**:

1. **SPU over SKU Indexing**: Each ES document represents a product (SPU) with nested SKUs
   - Reduces index size and improves search performance
   - Maintains variant information through nested structures

2. **Bilingual Field Strategy**: Separate `*_zh` and `*_en` fields
   - Enables language-specific analyzer configuration
   - Provides fallback mechanisms for missing translations

3. **Nested vs Flat Design**: Strategic use of nested vs flattened fields
   - `specifications` and `skus`: Nested for complex queries
   - `min_price`, `total_inventory`: Flattened for filtering/sorting

4. **Vector Field Isolation**: Embedding fields only used for search
   - Not returned in API responses (index: false where appropriate)
   - Reduces network payload and improves response times

### AI/ML Integration Details

**Text Embedding Pipeline**:
- **Model**: BGE-M3 (`Xorbits/bge-m3`)
- **Dimensions**: 1024-dimensional vectors
- **Hardware**: GPU/CUDA acceleration with CPU fallback
- **Caching**: Redis-based caching for common queries
- **Usage**: Semantic search combined with BM25 relevance

**Image Search Pipeline**:
- **Model**: CN-CLIP (ViT-H-14)
- **Processing**: URL download → preprocessing → vectorization
- **Storage**: Nested structure with vector + original URL
- **Application**: Visual similarity search for products

**Translation Integration**:
- **Service**: DeepL API with configurable auth
- **Languages**: Chinese ↔ English ↔ Russian support
- **Caching**: Translation result caching
- **Fallback**: Mock mode returns original text if API unavailable

## Development & Deployment

**Environment Configuration**:
```bash
# Core Services
./run.sh                    # Start all services
./scripts/start_backend.sh  # Backend only (port 6002)
./scripts/start_frontend.sh # Frontend UI (port 6003)

# Data Operations
./scripts/ingest.sh <tenant_id> [recreate]  # Index data
./scripts/mock_data.sh                    # Generate test data

# Testing
python -m pytest tests/    # Full test suite
python main.py search "query" --tenant-id 1  # Quick search test
```

**Key Files for Configuration**:
- `config/config.yaml`: Search behavior configuration
- `mappings/search_products.json`: ES index structure
- `.env`: Environment variables and secrets
- `api/models.py`: Pydantic request/response models

**Common Development Tasks**:
1. **Modifying Search Behavior**: Edit `config/config.yaml`
2. **Changing Index Structure**: Update `mappings/search_products.json`
3. **Adding New Filters**: Extend `api/models.py` with new Pydantic models
4. **Updating Ranking**: Modify `ranking.expression` in config
5. **Testing Queries**: Use frontend UI at http://localhost:6003

## Key Implementation Details

1. **Environment Variables**: All sensitive configuration in `.env` (template: `.env.example`)
2. **Configuration Management**: Centralized YAML config with validation
3. **Error Handling**: Comprehensive exception handling with proper HTTP status codes
4. **Performance**: Batch processing, embedding caching, connection pooling
5. **Logging**: Structured logging with request tracing and context
6. **Security**: Tenant isolation, rate limiting, CORS, security headers
7. **API Documentation**: Auto-generated FastAPI docs at `/docs`
8. **Multi-tenant Architecture**: Single index with `tenant_id` isolation
9. **Hybrid Search**: BM25 + vector similarity with configurable weighting
10. **Production Ready**: Health checks, monitoring, graceful degradation