Blame view

CLAUDE.md 4.89 KB
be52af70   tangwang   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
  # CLAUDE.md
  
  This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
  
  ## Project Overview
  
  This is a **Search Engine SaaS** project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking.
  
  **Tech Stack:**
  - Elasticsearch as the search engine backend
  - MySQL (Shoplazza database) as the primary data source
  - Python for data processing and ingestion
  - BGE-M3 model for text embeddings (1024-dim vectors)
  - CN-CLIP (ViT-H-14) for image embeddings
  
  ## Database Configuration
  
  **Shoplazza Production Database:**
  ```
  host: 120.79.247.228
  port: 3316
  database: saas
  username: saas
  password: P89cZHS5d7dFyc9R
  ```
  
  **Main Tables:**
  - `shoplazza_product_sku` - SKU level product data
  - `shoplazza_product_spu` - SPU level product data
  
  ## Architecture
  
  ### Data Flow
  1. **Data Source (MySQL)** → Main tables (`shoplazza_product_sku`, `shoplazza_product_spu`) + customer extension tables
  2. **Indexer** → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch
  3. **Query Parser** → Query rewriting, translation, text embedding conversion
  4. **Searcher** → Executes searches against Elasticsearch with configurable ranking
  
  ### Multi-Tenant Design
  Each customer has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with customer extension tables during indexing.
  
  ### Configuration System
  
  The system uses two types of configurations per customer:
  
  1. **Application Structure Config** (`IndexerConfig`) - Defines:
     - Input field mappings from MySQL to Elasticsearch
     - Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.)
     - Which fields require preprocessing (embeddings, transformations)
  
  2. **Index Structure Config** - Defines:
     - Elasticsearch field mappings and analyzers
     - Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
     - Query domain definitions (default, category_name, title, brand_name, etc.)
     - BM25 parameters and similarity configurations
  
  Reference files:
  - `商品数据源入ES配置规范.md` - ES mapping and analyzer configuration standards
  - `阿里opensearch电商行业.md` - Application and index structure examples from Alibaba OpenSearch
  
  ### Query Processing
  
  The `queryParser` performs:
  1. **Query Rewriting** - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections)
  2. **Translation** - Language detection and translation to support multi-language search (e.g., zh↔en)
  3. **Text Embedding** - Converts query text to vectors when vector search is enabled
  
  ### Search and Ranking
  
  The `searcher` supports:
  - Boolean operators: AND, OR, RANK, ANDNOT with parentheses
  - Operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
  - Configurable ranking expressions for the `default` domain:
    - Example: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`
    - Combines BM25 text relevance, embedding similarity, product scores, and time decay
  
  ### Embedding Modules
  
  **Text Embedding** - Uses BGE-M3 model (`Xorbits/bge-m3`):
  - Singleton pattern with thread-safe initialization
  - Generates 1024-dimensional vectors
  - Configured for GPU/CUDA acceleration
  
  **Image Embedding** - Uses CN-CLIP model (ViT-H-14):
  - Downloads and validates images from URLs
  - Preprocesses images (resize, RGB conversion)
  - Generates 1024-dimensional vectors
  - Supports both local and remote images
  
  ## Test Data
a406638e   tangwang   up
91
  
be52af70   tangwang   first commit
92
93
94
95
96
97
98
  **Customer1 Test Dataset:**
  - Location: `data/customer1/goods_with_pic.5years_congku.csv.shuf.1w`
  - Contains 10,000 shuffled product records with images
  - Processing script: `data/customer1/task2_process_goods.py`
    - Extracts product data from MySQL
    - Maps images from filebank database
    - Creates inverted index (URL → SKU list)
be52af70   tangwang   first commit
99
100
101
102
103
104
105
106
107
108
109
110
  
  ## Key Implementation Notes
  
  1. **Data Sync:** Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes.
  
  2. **Extension Tables:** When designing customer configurations, determine which fields exist in the main SKU table vs. which need to be added to customer-specific extension tables.
  
  3. **Embedding Caching:** For periodic full indexing, embedding results should be cached to avoid recomputation.
  
  4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity.
  
  5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese).
16c42787   tangwang   feat: implement r...
111
  - 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh && conda activate searchengine