CLAUDE.md 4.97 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a Search Engine SaaS project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking.

Tech Stack:

  • Elasticsearch as the search engine backend
  • MySQL (Shoplazza database) as the primary data source
  • Python for data processing and ingestion
  • BGE-M3 model for text embeddings (1024-dim vectors)
  • CN-CLIP (ViT-H-14) for image embeddings

Database Configuration

Shoplazza Production Database:

host: 120.79.247.228
port: 3316
database: saas
username: saas
password: P89cZHS5d7dFyc9R

Main Tables:

  • shoplazza_product_sku - SKU level product data
  • shoplazza_product_spu - SPU level product data

Architecture

Data Flow

  1. Data Source (MySQL) → Main tables (shoplazza_product_sku, shoplazza_product_spu) + customer extension tables
  2. Indexer → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch
  3. Query Parser → Query rewriting, translation, text embedding conversion
  4. Searcher → Executes searches against Elasticsearch with configurable ranking

Multi-Tenant Design

Each customer has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with customer extension tables during indexing.

Configuration System

The system uses two types of configurations per customer:

  1. Application Structure Config (IndexerConfig) - Defines:

    • Input field mappings from MySQL to Elasticsearch
    • Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.)
    • Which fields require preprocessing (embeddings, transformations)
  2. Index Structure Config - Defines:

    • Elasticsearch field mappings and analyzers
    • Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
    • Query domain definitions (default, category_name, title, brand_name, etc.)
    • BM25 parameters and similarity configurations

Reference files:

  • 商品数据源入ES配置规范.md - ES mapping and analyzer configuration standards
  • 阿里opensearch电商行业.md - Application and index structure examples from Alibaba OpenSearch

Query Processing

The queryParser performs:

  1. Query Rewriting - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections)
  2. Translation - Language detection and translation to support multi-language search (e.g., zh↔en)
  3. Text Embedding - Converts query text to vectors when vector search is enabled

Search and Ranking

The searcher supports:

  • Boolean operators: AND, OR, RANK, ANDNOT with parentheses
  • Operator precedence: () > ANDNOT > AND > OR > RANK
  • Configurable ranking expressions for the default domain:
    • Example: static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)
    • Combines BM25 text relevance, embedding similarity, product scores, and time decay

Embedding Modules

Text Embedding - Uses BGE-M3 model (Xorbits/bge-m3):

  • Singleton pattern with thread-safe initialization
  • Generates 1024-dimensional vectors
  • Configured for GPU/CUDA acceleration

Image Embedding - Uses CN-CLIP model (ViT-H-14):

  • Downloads and validates images from URLs
  • Preprocesses images (resize, RGB conversion)
  • Generates 1024-dimensional vectors
  • Supports both local and remote images

Test Data

Customer1 Test Dataset:

  • Location: data/customer1/goods_with_pic.5years_congku.csv.shuf.1w
  • Contains 10,000 shuffled product records with images
  • Processing script: data/customer1/task2_process_goods.py
    • Extracts product data from MySQL
    • Maps images from filebank database
    • Creates inverted index (URL → SKU list)
    • Configurable year range via --years parameter

Key Implementation Notes

  1. Data Sync: Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes.

  2. Extension Tables: When designing customer configurations, determine which fields exist in the main SKU table vs. which need to be added to customer-specific extension tables.

  3. Embedding Caching: For periodic full indexing, embedding results should be cached to avoid recomputation.

  4. ES Similarity Configuration: All text fields use modified BM25 with b=0.0, k1=0.0 as the default similarity.

  5. Multi-Language Support: The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese).

  6. 记住这个项目的环境是

  7. 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh conda activate searchengine