CLAUDE.md

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This is a **Search Engine SaaS** platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities.
**Tech Stack:**
- Elasticsearch 8.x as the search engine backend
- MySQL (Shoplazza database) as the primary data source
- Python 3.10 with PyTorch/CUDA support
- BGE-M3 model for text embeddings (1024-dim vectors)
- CN-CLIP (ViT-H-14) for image embeddings
- FastAPI for REST API layer
## Development Environment
**Required Environment Setup:**
```bash
source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine
```
**Database Configuration:**
```
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R
```
## Common Development Commands
### Environment Setup
```bash
# Complete environment setup
./setup.sh
# Install Python dependencies
pip install -r requirements.txt
```
### Data Management
```bash
# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh
# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50
```
### Running Services
```bash
# Start all services (production)
./run.sh
# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload
# Start frontend debugging UI
./scripts/start_frontend.sh
```
### Testing
```bash
# Run all tests
python -m pytest tests/
# Run specific test types
python -m pytest tests/unit/          # Unit tests
python -m pytest tests/integration/   # Integration tests
python -m pytest -m "api"             # API tests only
# Test search from command line
python main.py search "query" --tenant-id 1 --size 10
```
### Development Utilities
```bash
# Stop all services
./scripts/stop.sh
# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh
# Install server dependencies
./scripts/install_server_deps.sh
```
## Architecture Overview
### Core Components
```
/data/tw/SearchEngine/
├── api/              # FastAPI REST API service (port 6002)
├── config/           # Configuration management system
├── indexer/          # MySQL → Elasticsearch data pipeline
├── search/           # Search engine and ranking logic
├── query/            # Query parsing, translation, rewriting
├── embeddings/       # ML models (BGE-M3, CN-CLIP)
├── scripts/          # Automation and utility scripts
├── utils/            # Shared utilities (ES client, etc.)
├── frontend/         # Simple debugging UI
├── mappings/         # Elasticsearch index mappings
└── tests/            # Unit and integration tests
```
### Data Flow Architecture
**Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend
1. **Data Source Layer**:
   - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables
   - Tenant-specific extension tables for custom attributes and multi-language fields
2. **Indexing Layer** (`indexer/`):
   - Reads from MySQL, applies transformations with embeddings
   - Uses `DataTransformer` and `IndexingPipeline` for batch processing
   - Supports both full and incremental indexing with embedding caching
3. **Query Processing Layer** (`query/`):
   - `QueryParser`: Handles query rewriting, translation, and text embedding conversion
   - Multi-language support with automatic detection and translation
   - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
4. **Search Engine Layer** (`search/`):
   - `Searcher`: Executes hybrid searches combining BM25 and dense vectors
   - Configurable ranking expressions with function_score support
   - Multi-tenant isolation via `tenant_id` field
5. **API Layer** (`api/`):
   - FastAPI service on port 6002 with multi-tenant support
   - Text search: `POST /search/`
   - Image search: `POST /image-search/`
   - Tenant identification via `X-Tenant-ID` header
### Multi-Tenant Configuration System
The system uses centralized configuration through `config/config.yaml`:
1. **Field Configuration** (`config/field_types.py`):
   - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
   - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
   - Required fields and preprocessing rules
2. **Index Configuration** (`mappings/search_products.json`):
   - Unified index structure shared by all tenants
   - Elasticsearch field mappings and analyzer configurations
   - BM25 similarity with modified parameters (`b=0.0, k1=0.0`)
3. **Query Configuration** (`search/query_config.py`):
   - Query domain definitions (default, category_name, title, brand_name, etc.)
   - Ranking expressions and function_score configurations
   - Translation and embedding settings
### Embedding Models
**Text Embedding** (`embeddings/bge_encoder.py`):
- Uses BGE-M3 model (`Xorbits/bge-m3`)
- Singleton pattern with thread-safe initialization
- Generates 1024-dimensional vectors with GPU/CUDA support
- Configurable caching to avoid recomputation
**Image Embedding** (`embeddings/clip_encoder.py`):
- Uses CN-CLIP model (ViT-H-14)
- Downloads and preprocesses images from URLs
- Supports both local and remote image processing
- Generates 1024-dimensional vectors
### Search and Ranking
**Hybrid Search Approach**:
- Combines traditional BM25 text relevance with dense vector similarity
- Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
- Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`
**Boolean Search Support**:
- Full boolean logic with AND, OR, ANDNOT, RANK operators
- Parentheses for complex query structures
- Configurable operator precedence
**Faceted Search**:
- Terms and range faceting support
- Multi-dimensional filtering capabilities
- Configurable facet fields and aggregations
## Testing Infrastructure
**Test Framework**: pytest with async support
**Test Structure**:
- `tests/conftest.py`: Comprehensive test fixtures and configuration
- `tests/unit/`: Unit tests for individual components
- `tests/integration/`: Integration tests for system workflows
- Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api`
**Test Data**:
- Tenant1: Mock data with 10,000 product records
- Tenant2: CSV-based test dataset
- Automated test data generation via `scripts/mock_data.sh`
**Key Test Fixtures** (from `conftest.py`):
- `sample_search_config`: Complete configuration for testing
- `mock_es_client`: Mocked Elasticsearch client
- `test_searcher`: Searcher instance with mock dependencies
- `temp_config_file`: Temporary YAML configuration for tests
## API Endpoints
**Main API** (FastAPI on port 6002):
- `POST /search/` - Text search with multi-language support
- `POST /image-search/` - Image search using CN-CLIP embeddings
- Health check and management endpoints
- Multi-tenant support via `X-Tenant-ID` header
**API Features**:
- Hybrid search combining text and vector similarity
- Configurable ranking and filtering
- Faceted search with aggregations
- Multi-language query processing and translation
- Real-time search with configurable result sizes
## Key Implementation Details
1. **Environment Variables**: All sensitive configuration stored in `.env` (template: `.env.example`)
2. **Configuration Management**: Dynamic config loading through `config/config_loader.py`
3. **Error Handling**: Comprehensive error handling with proper HTTP status codes
4. **Performance**: Batch processing for indexing, embedding caching, and connection pooling
5. **Logging**: Structured logging with request tracing for debugging
6. **Security**: Tenant isolation at the index level with proper access controls
## Database Tables
**Main Tables**:
- `shoplazza_product_sku` - SKU level product data with pricing and inventory
- `shoplazza_product_spu` - SPU level product data with categories and attributes
- Tenant extension tables for custom fields and multi-language content
**Data Processing**:
- Full data sync handled by separate Java project (not in this repo)
- This repository includes test implementations for development and debugging
- Extension tables joined with main tables during indexing process