Blame view

CLAUDE.md 4.67 KB
be52af70   tangwang   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
  # CLAUDE.md
  
  This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
  
  ## Project Overview
  
  This is a **Search Engine SaaS** project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking.
  
  **Tech Stack:**
  - Elasticsearch as the search engine backend
  - MySQL (Shoplazza database) as the primary data source
  - Python for data processing and ingestion
  - BGE-M3 model for text embeddings (1024-dim vectors)
  - CN-CLIP (ViT-H-14) for image embeddings
  
  ## Database Configuration
  
  **Shoplazza Production Database:**
  ```
  host: 120.79.247.228
  port: 3316
  database: saas
  username: saas
  password: P89cZHS5d7dFyc9R
  ```
  
  **Main Tables:**
  - `shoplazza_product_sku` - SKU level product data
  - `shoplazza_product_spu` - SPU level product data
  
  ## Architecture
  
  ### Data Flow
ae5a294d   tangwang   命名修改、代码清理
34
  1. **Data Source (MySQL)** → Main tables (`shoplazza_product_sku`, `shoplazza_product_spu`) + tenant extension tables
be52af70   tangwang   first commit
35
36
37
38
39
  2. **Indexer** → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch
  3. **Query Parser** → Query rewriting, translation, text embedding conversion
  4. **Searcher** → Executes searches against Elasticsearch with configurable ranking
  
  ### Multi-Tenant Design
ae5a294d   tangwang   命名修改、代码清理
40
  Each tenant has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with tenant extension tables during indexing.
be52af70   tangwang   first commit
41
42
43
  
  ### Configuration System
  
ae5a294d   tangwang   命名修改、代码清理
44
  The system uses two types of configurations per tenant:
be52af70   tangwang   first commit
45
46
47
48
49
50
51
52
53
54
55
56
  
  1. **Application Structure Config** (`IndexerConfig`) - Defines:
     - Input field mappings from MySQL to Elasticsearch
     - Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.)
     - Which fields require preprocessing (embeddings, transformations)
  
  2. **Index Structure Config** - Defines:
     - Elasticsearch field mappings and analyzers
     - Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
     - Query domain definitions (default, category_name, title, brand_name, etc.)
     - BM25 parameters and similarity configurations
  
be52af70   tangwang   first commit
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
  ### Query Processing
  
  The `queryParser` performs:
  1. **Query Rewriting** - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections)
  2. **Translation** - Language detection and translation to support multi-language search (e.g., zh↔en)
  3. **Text Embedding** - Converts query text to vectors when vector search is enabled
  
  ### Search and Ranking
  
  The `searcher` supports:
  - Boolean operators: AND, OR, RANK, ANDNOT with parentheses
  - Operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
  - Configurable ranking expressions for the `default` domain:
    - Example: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`
    - Combines BM25 text relevance, embedding similarity, product scores, and time decay
  
  ### Embedding Modules
  
  **Text Embedding** - Uses BGE-M3 model (`Xorbits/bge-m3`):
  - Singleton pattern with thread-safe initialization
  - Generates 1024-dimensional vectors
  - Configured for GPU/CUDA acceleration
  
  **Image Embedding** - Uses CN-CLIP model (ViT-H-14):
  - Downloads and validates images from URLs
  - Preprocesses images (resize, RGB conversion)
  - Generates 1024-dimensional vectors
  - Supports both local and remote images
  
  ## Test Data
a406638e   tangwang   up
87
  
ae5a294d   tangwang   命名修改、代码清理
88
89
  **Tenant1 Test Dataset:**
  - Location: `data/tenant1/goods_with_pic.5years_congku.csv.shuf.1w`
be52af70   tangwang   first commit
90
  - Contains 10,000 shuffled product records with images
ae5a294d   tangwang   命名修改、代码清理
91
  - Processing script: `data/tenant1/task2_process_goods.py`
be52af70   tangwang   first commit
92
93
94
    - Extracts product data from MySQL
    - Maps images from filebank database
    - Creates inverted index (URL → SKU list)
be52af70   tangwang   first commit
95
96
97
98
99
  
  ## Key Implementation Notes
  
  1. **Data Sync:** Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes.
  
ae5a294d   tangwang   命名修改、代码清理
100
  2. **Extension Tables:** When designing tenant configurations, determine which fields exist in the main SKU table vs. which need to be added to tenant-specific extension tables.
be52af70   tangwang   first commit
101
102
103
104
105
106
  
  3. **Embedding Caching:** For periodic full indexing, embedding results should be cached to avoid recomputation.
  
  4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity.
  
  5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese).
16c42787   tangwang   feat: implement r...
107
  - 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh && conda activate searchengine