CLAUDE.md 8.43 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a Search Engine SaaS platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities.

Tech Stack:

  • Elasticsearch 8.x as the search engine backend
  • MySQL (Shoplazza database) as the primary data source
  • Python 3.10 with PyTorch/CUDA support
  • BGE-M3 model for text embeddings (1024-dim vectors)
  • CN-CLIP (ViT-H-14) for image embeddings
  • FastAPI for REST API layer

Development Environment

Required Environment Setup:

source /home/tw/miniconda3/etc/profile.d/conda.sh
conda activate searchengine

Database Configuration:

host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R

Common Development Commands

Environment Setup

# Complete environment setup
./setup.sh

# Install Python dependencies
pip install -r requirements.txt

Data Management

# Generate test data (Tenant1 Mock + Tenant2 CSV)
./scripts/mock_data.sh

# Ingest data to Elasticsearch
./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
python main.py ingest data.csv --limit 1000 --batch-size 50

Running Services

# Start all services (production)
./run.sh

# Start development server with auto-reload
./scripts/start_backend.sh
python main.py serve --host 0.0.0.0 --port 6002 --reload

# Start frontend debugging UI
./scripts/start_frontend.sh

Testing

# Run all tests
python -m pytest tests/

# Run specific test types
python -m pytest tests/unit/          # Unit tests
python -m pytest tests/integration/   # Integration tests
python -m pytest -m "api"             # API tests only

# Test search from command line
python main.py search "query" --tenant-id 1 --size 10

Development Utilities

# Stop all services
./scripts/stop.sh

# Test environment (for CI/development)
./scripts/start_test_environment.sh
./scripts/stop_test_environment.sh

# Install server dependencies
./scripts/install_server_deps.sh

Architecture Overview

Core Components

/data/tw/SearchEngine/
├── api/              # FastAPI REST API service (port 6002)
├── config/           # Configuration management system
├── indexer/          # MySQL → Elasticsearch data pipeline
├── search/           # Search engine and ranking logic
├── query/            # Query parsing, translation, rewriting
├── embeddings/       # ML models (BGE-M3, CN-CLIP)
├── scripts/          # Automation and utility scripts
├── utils/            # Shared utilities (ES client, etc.)
├── frontend/         # Simple debugging UI
├── mappings/         # Elasticsearch index mappings
└── tests/            # Unit and integration tests

Data Flow Architecture

Pipeline: MySQL → Indexer → Elasticsearch → API → Frontend

  1. Data Source Layer:

    • Shoplazza MySQL database with shoplazza_product_sku and shoplazza_product_spu tables
    • Tenant-specific extension tables for custom attributes and multi-language fields
  2. Indexing Layer (indexer/):

    • Reads from MySQL, applies transformations with embeddings
    • Uses DataTransformer and IndexingPipeline for batch processing
    • Supports both full and incremental indexing with embedding caching
  3. Query Processing Layer (query/):

    • QueryParser: Handles query rewriting, translation, and text embedding conversion
    • Multi-language support with automatic detection and translation
    • Boolean logic parsing with operator precedence: () > ANDNOT > AND > OR > RANK
  4. Search Engine Layer (search/):

    • Searcher: Executes hybrid searches combining BM25 and dense vectors
    • Configurable ranking expressions with function_score support
    • Multi-tenant isolation via tenant_id field
  5. API Layer (api/):

    • FastAPI service on port 6002 with multi-tenant support
    • Text search: POST /search/
    • Image search: POST /image-search/
    • Tenant identification via X-Tenant-ID header

Multi-Tenant Configuration System

The system uses centralized configuration through config/config.yaml:

  1. Field Configuration (config/field_types.py):

    • Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
    • Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
    • Required fields and preprocessing rules
  2. Index Configuration (mappings/search_products.json):

    • Unified index structure shared by all tenants
    • Elasticsearch field mappings and analyzer configurations
    • BM25 similarity with modified parameters (b=0.0, k1=0.0)
  3. Query Configuration (search/query_config.py):

    • Query domain definitions (default, category_name, title, brand_name, etc.)
    • Ranking expressions and function_score configurations
    • Translation and embedding settings

Embedding Models

Text Embedding (embeddings/bge_encoder.py):

  • Uses BGE-M3 model (Xorbits/bge-m3)
  • Singleton pattern with thread-safe initialization
  • Generates 1024-dimensional vectors with GPU/CUDA support
  • Configurable caching to avoid recomputation

Image Embedding (embeddings/clip_encoder.py):

  • Uses CN-CLIP model (ViT-H-14)
  • Downloads and preprocesses images from URLs
  • Supports both local and remote image processing
  • Generates 1024-dimensional vectors

Search and Ranking

Hybrid Search Approach:

  • Combines traditional BM25 text relevance with dense vector similarity
  • Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
  • Configurable ranking expressions like: static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)

Boolean Search Support:

  • Full boolean logic with AND, OR, ANDNOT, RANK operators
  • Parentheses for complex query structures
  • Configurable operator precedence

Faceted Search:

  • Terms and range faceting support
  • Multi-dimensional filtering capabilities
  • Configurable facet fields and aggregations

Testing Infrastructure

Test Framework: pytest with async support

Test Structure:

  • tests/conftest.py: Comprehensive test fixtures and configuration
  • tests/unit/: Unit tests for individual components
  • tests/integration/: Integration tests for system workflows
  • Test markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.api

Test Data:

  • Tenant1: Mock data with 10,000 product records
  • Tenant2: CSV-based test dataset
  • Automated test data generation via scripts/mock_data.sh

Key Test Fixtures (from conftest.py):

  • sample_search_config: Complete configuration for testing
  • mock_es_client: Mocked Elasticsearch client
  • test_searcher: Searcher instance with mock dependencies
  • temp_config_file: Temporary YAML configuration for tests

API Endpoints

Main API (FastAPI on port 6002):

  • POST /search/ - Text search with multi-language support
  • POST /image-search/ - Image search using CN-CLIP embeddings
  • Health check and management endpoints
  • Multi-tenant support via X-Tenant-ID header

API Features:

  • Hybrid search combining text and vector similarity
  • Configurable ranking and filtering
  • Faceted search with aggregations
  • Multi-language query processing and translation
  • Real-time search with configurable result sizes

Key Implementation Details

  1. Environment Variables: All sensitive configuration stored in .env (template: .env.example)
  2. Configuration Management: Dynamic config loading through config/config_loader.py
  3. Error Handling: Comprehensive error handling with proper HTTP status codes
  4. Performance: Batch processing for indexing, embedding caching, and connection pooling
  5. Logging: Structured logging with request tracing for debugging
  6. Security: Tenant isolation at the index level with proper access controls

Database Tables

Main Tables:

  • shoplazza_product_sku - SKU level product data with pricing and inventory
  • shoplazza_product_spu - SPU level product data with categories and attributes
  • Tenant extension tables for custom fields and multi-language content

Data Processing:

  • Full data sync handled by separate Java project (not in this repo)
  • This repository includes test implementations for development and debugging
  • Extension tables joined with main tables during indexing process