Blame view

CLAUDE.md 8.43 KB
be52af70   tangwang   first commit
1
2
3
4
5
6
  # CLAUDE.md
  
  This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
  
  ## Project Overview
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
7
  This is a **Search Engine SaaS** platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities.
be52af70   tangwang   first commit
8
9
  
  **Tech Stack:**
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
10
  - Elasticsearch 8.x as the search engine backend
be52af70   tangwang   first commit
11
  - MySQL (Shoplazza database) as the primary data source
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
12
  - Python 3.10 with PyTorch/CUDA support
be52af70   tangwang   first commit
13
14
  - BGE-M3 model for text embeddings (1024-dim vectors)
  - CN-CLIP (ViT-H-14) for image embeddings
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
15
  - FastAPI for REST API layer
be52af70   tangwang   first commit
16
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
17
  ## Development Environment
be52af70   tangwang   first commit
18
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
19
20
21
22
23
24
25
  **Required Environment Setup:**
  ```bash
  source /home/tw/miniconda3/etc/profile.d/conda.sh
  conda activate searchengine
  ```
  
  **Database Configuration:**
be52af70   tangwang   first commit
26
27
28
29
30
31
32
33
  ```
  host: 120.79.247.228
  port: 3316
  database: saas
  username: saas
  password: P89cZHS5d7dFyc9R
  ```
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
34
  ## Common Development Commands
be52af70   tangwang   first commit
35
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
36
37
38
39
  ### Environment Setup
  ```bash
  # Complete environment setup
  ./setup.sh
be52af70   tangwang   first commit
40
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
41
42
43
  # Install Python dependencies
  pip install -r requirements.txt
  ```
be52af70   tangwang   first commit
44
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
45
46
47
48
  ### Data Management
  ```bash
  # Generate test data (Tenant1 Mock + Tenant2 CSV)
  ./scripts/mock_data.sh
be52af70   tangwang   first commit
49
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
50
51
52
53
  # Ingest data to Elasticsearch
  ./scripts/ingest.sh <tenant_id> [recreate]  # e.g., ./scripts/ingest.sh 1 true
  python main.py ingest data.csv --limit 1000 --batch-size 50
  ```
be52af70   tangwang   first commit
54
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
55
56
57
58
  ### Running Services
  ```bash
  # Start all services (production)
  ./run.sh
be52af70   tangwang   first commit
59
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
60
61
62
  # Start development server with auto-reload
  ./scripts/start_backend.sh
  python main.py serve --host 0.0.0.0 --port 6002 --reload
be52af70   tangwang   first commit
63
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
64
65
66
  # Start frontend debugging UI
  ./scripts/start_frontend.sh
  ```
be52af70   tangwang   first commit
67
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
68
69
70
71
  ### Testing
  ```bash
  # Run all tests
  python -m pytest tests/
be52af70   tangwang   first commit
72
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
73
74
75
76
  # Run specific test types
  python -m pytest tests/unit/          # Unit tests
  python -m pytest tests/integration/   # Integration tests
  python -m pytest -m "api"             # API tests only
be52af70   tangwang   first commit
77
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
78
79
80
  # Test search from command line
  python main.py search "query" --tenant-id 1 --size 10
  ```
be52af70   tangwang   first commit
81
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
82
83
84
85
  ### Development Utilities
  ```bash
  # Stop all services
  ./scripts/stop.sh
be52af70   tangwang   first commit
86
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
87
88
89
  # Test environment (for CI/development)
  ./scripts/start_test_environment.sh
  ./scripts/stop_test_environment.sh
be52af70   tangwang   first commit
90
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
91
92
93
  # Install server dependencies
  ./scripts/install_server_deps.sh
  ```
be52af70   tangwang   first commit
94
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
  ## Architecture Overview
  
  ### Core Components
  ```
  /data/tw/SearchEngine/
  ├── api/              # FastAPI REST API service (port 6002)
  ├── config/           # Configuration management system
  ├── indexer/          # MySQL → Elasticsearch data pipeline
  ├── search/           # Search engine and ranking logic
  ├── query/            # Query parsing, translation, rewriting
  ├── embeddings/       # ML models (BGE-M3, CN-CLIP)
  ├── scripts/          # Automation and utility scripts
  ├── utils/            # Shared utilities (ES client, etc.)
  ├── frontend/         # Simple debugging UI
  ├── mappings/         # Elasticsearch index mappings
  └── tests/            # Unit and integration tests
  ```
  
  ### Data Flow Architecture
  **Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend
  
  1. **Data Source Layer**:
     - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables
     - Tenant-specific extension tables for custom attributes and multi-language fields
  
  2. **Indexing Layer** (`indexer/`):
     - Reads from MySQL, applies transformations with embeddings
     - Uses `DataTransformer` and `IndexingPipeline` for batch processing
     - Supports both full and incremental indexing with embedding caching
  
  3. **Query Processing Layer** (`query/`):
     - `QueryParser`: Handles query rewriting, translation, and text embedding conversion
     - Multi-language support with automatic detection and translation
     - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
  
  4. **Search Engine Layer** (`search/`):
     - `Searcher`: Executes hybrid searches combining BM25 and dense vectors
     - Configurable ranking expressions with function_score support
     - Multi-tenant isolation via `tenant_id` field
  
  5. **API Layer** (`api/`):
     - FastAPI service on port 6002 with multi-tenant support
     - Text search: `POST /search/`
     - Image search: `POST /image-search/`
     - Tenant identification via `X-Tenant-ID` header
be52af70   tangwang   first commit
140
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
141
  ### Multi-Tenant Configuration System
be52af70   tangwang   first commit
142
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
143
  The system uses centralized configuration through `config/config.yaml`:
be52af70   tangwang   first commit
144
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
145
146
147
148
  1. **Field Configuration** (`config/field_types.py`):
     - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
     - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
     - Required fields and preprocessing rules
be52af70   tangwang   first commit
149
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
150
151
152
153
  2. **Index Configuration** (`mappings/search_products.json`):
     - Unified index structure shared by all tenants
     - Elasticsearch field mappings and analyzer configurations
     - BM25 similarity with modified parameters (`b=0.0, k1=0.0`)
be52af70   tangwang   first commit
154
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
155
156
157
158
159
160
161
162
163
164
165
166
  3. **Query Configuration** (`search/query_config.py`):
     - Query domain definitions (default, category_name, title, brand_name, etc.)
     - Ranking expressions and function_score configurations
     - Translation and embedding settings
  
  ### Embedding Models
  
  **Text Embedding** (`embeddings/bge_encoder.py`):
  - Uses BGE-M3 model (`Xorbits/bge-m3`)
  - Singleton pattern with thread-safe initialization
  - Generates 1024-dimensional vectors with GPU/CUDA support
  - Configurable caching to avoid recomputation
be52af70   tangwang   first commit
167
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
168
169
170
171
172
173
174
  **Image Embedding** (`embeddings/clip_encoder.py`):
  - Uses CN-CLIP model (ViT-H-14)
  - Downloads and preprocesses images from URLs
  - Supports both local and remote image processing
  - Generates 1024-dimensional vectors
  
  ### Search and Ranking
be52af70   tangwang   first commit
175
  
acf1349c   tangwang   fake 批量导入数据的脚步 ( ...
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
  **Hybrid Search Approach**:
  - Combines traditional BM25 text relevance with dense vector similarity
  - Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
  - Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`
  
  **Boolean Search Support**:
  - Full boolean logic with AND, OR, ANDNOT, RANK operators
  - Parentheses for complex query structures
  - Configurable operator precedence
  
  **Faceted Search**:
  - Terms and range faceting support
  - Multi-dimensional filtering capabilities
  - Configurable facet fields and aggregations
  
  ## Testing Infrastructure
  
  **Test Framework**: pytest with async support
  
  **Test Structure**:
  - `tests/conftest.py`: Comprehensive test fixtures and configuration
  - `tests/unit/`: Unit tests for individual components
  - `tests/integration/`: Integration tests for system workflows
  - Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api`
  
  **Test Data**:
  - Tenant1: Mock data with 10,000 product records
  - Tenant2: CSV-based test dataset
  - Automated test data generation via `scripts/mock_data.sh`
  
  **Key Test Fixtures** (from `conftest.py`):
  - `sample_search_config`: Complete configuration for testing
  - `mock_es_client`: Mocked Elasticsearch client
  - `test_searcher`: Searcher instance with mock dependencies
  - `temp_config_file`: Temporary YAML configuration for tests
  
  ## API Endpoints
  
  **Main API** (FastAPI on port 6002):
  - `POST /search/` - Text search with multi-language support
  - `POST /image-search/` - Image search using CN-CLIP embeddings
  - Health check and management endpoints
  - Multi-tenant support via `X-Tenant-ID` header
  
  **API Features**:
  - Hybrid search combining text and vector similarity
  - Configurable ranking and filtering
  - Faceted search with aggregations
  - Multi-language query processing and translation
  - Real-time search with configurable result sizes
  
  ## Key Implementation Details
  
  1. **Environment Variables**: All sensitive configuration stored in `.env` (template: `.env.example`)
  2. **Configuration Management**: Dynamic config loading through `config/config_loader.py`
  3. **Error Handling**: Comprehensive error handling with proper HTTP status codes
  4. **Performance**: Batch processing for indexing, embedding caching, and connection pooling
  5. **Logging**: Structured logging with request tracing for debugging
  6. **Security**: Tenant isolation at the index level with proper access controls
  
  ## Database Tables
  
  **Main Tables**:
  - `shoplazza_product_sku` - SKU level product data with pricing and inventory
  - `shoplazza_product_spu` - SPU level product data with categories and attributes
  - Tenant extension tables for custom fields and multi-language content
  
  **Data Processing**:
  - Full data sync handled by separate Java project (not in this repo)
  - This repository includes test implementations for development and debugging
  - Extension tables joined with main tables during indexing process
16c42787   tangwang   feat: implement r...