CLAUDE.md 8.43 KB
Edit Raw Blame History


CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a Search Engine SaaS platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities.

Tech Stack:


Elasticsearch 8.x as the search engine backend
MySQL (Shoplazza database) as the primary data source
Python 3.10 with PyTorch/CUDA support
BGE-M3 model for text embeddings (1024-dim vectors)
CN-CLIP (ViT-H-14) for image embeddings
FastAPI for REST API layer

Development Environment
Required Environment Setup:
source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine


Database Configuration:
host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R

Common Development Commands
Environment Setup
# Complete environment setup
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

Data Management
# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh

# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50

Running Services
# Start all services (production)
./run.sh

# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload

# Start frontend debugging UI
./scripts/start_frontend.sh

Testing
# Run all tests
python -m pytest tests/

# Run specific test types
python -m pytest tests/unit/          # Unit tests
python -m pytest tests/integration/   # Integration tests
python -m pytest -m "api"             # API tests only

# Test search from command line
python main.py search "query" --tenant-id 1 --size 10

Development Utilities
# Stop all services
./scripts/stop.sh

# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh

# Install server dependencies
./scripts/install_server_deps.sh

Architecture Overview
Core Components
/data/tw/SearchEngine/
├── api/              # FastAPI REST API service (port 6002)
├── config/           # Configuration management system
├── indexer/          # MySQL → Elasticsearch data pipeline
├── search/           # Search engine and ranking logic
├── query/            # Query parsing, translation, rewriting
├── embeddings/       # ML models (BGE-M3, CN-CLIP)
├── scripts/          # Automation and utility scripts
├── utils/            # Shared utilities (ES client, etc.)
├── frontend/         # Simple debugging UI
├── mappings/         # Elasticsearch index mappings
└── tests/            # Unit and integration tests

Data Flow Architecture
Pipeline: MySQL → Indexer → Elasticsearch → API → Frontend


Data Source Layer:


Shoplazza MySQL database with shoplazza_product_sku and shoplazza_product_spu tables
Tenant-specific extension tables for custom attributes and multi-language fields

Indexing Layer (indexer/):


Reads from MySQL, applies transformations with embeddings
Uses DataTransformer and IndexingPipeline for batch processing
Supports both full and incremental indexing with embedding caching

Query Processing Layer (query/):


QueryParser: Handles query rewriting, translation, and text embedding conversion
Multi-language support with automatic detection and translation
Boolean logic parsing with operator precedence: () > ANDNOT > AND > OR > RANK

Search Engine Layer (search/):


Searcher: Executes hybrid searches combining BM25 and dense vectors
Configurable ranking expressions with function_score support
Multi-tenant isolation via tenant_id field

API Layer (api/):


FastAPI service on port 6002 with multi-tenant support
Text search: POST /search/
Image search: POST /image-search/
Tenant identification via X-Tenant-ID header


Multi-Tenant Configuration System
The system uses centralized configuration through config/config.yaml:


Field Configuration (config/field_types.py):


Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
Required fields and preprocessing rules

Index Configuration (mappings/search_products.json):


Unified index structure shared by all tenants
Elasticsearch field mappings and analyzer configurations
BM25 similarity with modified parameters (b=0.0, k1=0.0)

Query Configuration (search/query_config.py):


Query domain definitions (default, category_name, title, brand_name, etc.)
Ranking expressions and function_score configurations
Translation and embedding settings


Embedding Models
Text Embedding (embeddings/bge_encoder.py):


Uses BGE-M3 model (Xorbits/bge-m3)
Singleton pattern with thread-safe initialization
Generates 1024-dimensional vectors with GPU/CUDA support
Configurable caching to avoid recomputation


Image Embedding (embeddings/clip_encoder.py):


Uses CN-CLIP model (ViT-H-14)
Downloads and preprocesses images from URLs
Supports both local and remote image processing
Generates 1024-dimensional vectors

Search and Ranking
Hybrid Search Approach:


Combines traditional BM25 text relevance with dense vector similarity
Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
Configurable ranking expressions like: static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)


Boolean Search Support:


Full boolean logic with AND, OR, ANDNOT, RANK operators
Parentheses for complex query structures
Configurable operator precedence


Faceted Search:


Terms and range faceting support
Multi-dimensional filtering capabilities
Configurable facet fields and aggregations

Testing Infrastructure
Test Framework: pytest with async support

Test Structure:


tests/conftest.py: Comprehensive test fixtures and configuration
tests/unit/: Unit tests for individual components
tests/integration/: Integration tests for system workflows
Test markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.api


Test Data:


Tenant1: Mock data with 10,000 product records
Tenant2: CSV-based test dataset
Automated test data generation via scripts/mock_data.sh


Key Test Fixtures (from conftest.py):


sample_search_config: Complete configuration for testing
mock_es_client: Mocked Elasticsearch client
test_searcher: Searcher instance with mock dependencies
temp_config_file: Temporary YAML configuration for tests

API Endpoints
Main API (FastAPI on port 6002):


POST /search/ - Text search with multi-language support
POST /image-search/ - Image search using CN-CLIP embeddings
Health check and management endpoints
Multi-tenant support via X-Tenant-ID header


API Features:


Hybrid search combining text and vector similarity
Configurable ranking and filtering
Faceted search with aggregations
Multi-language query processing and translation
Real-time search with configurable result sizes

Key Implementation Details

Environment Variables: All sensitive configuration stored in .env (template: .env.example)
Configuration Management: Dynamic config loading through config/config_loader.py
Error Handling: Comprehensive error handling with proper HTTP status codes
Performance: Batch processing for indexing, embedding caching, and connection pooling
Logging: Structured logging with request tracing for debugging
Security: Tenant isolation at the index level with proper access controls

Database Tables
Main Tables:


shoplazza_product_sku - SKU level product data with pricing and inventory
shoplazza_product_spu - SPU level product data with categories and attributes
Tenant extension tables for custom fields and multi-language content


Data Processing:


Full data sync handled by separate Java project (not in this repo)
This repository includes test implementations for development and debugging
Extension tables joined with main tables during indexing process