be52af70
tangwang
first commit
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
|
# Quick Start Guide
## Prerequisites
1. **Python 3.8+**
2. **Elasticsearch 8.x** (running on localhost:9200 or remote)
3. **Optional**: CUDA-enabled GPU for faster embeddings
## Installation
### 1. Install Dependencies
```bash
cd /data/tw/SearchEngine
pip install -r requirements.txt
```
### 2. Set Environment Variables (Optional)
```bash
# Elasticsearch
export ES_HOST="http://localhost:9200"
# DeepL API (for translation)
export DEEPL_API_KEY="your-api-key-here"
# Customer ID
export CUSTOMER_ID="customer1"
```
## Running the System
### Option 1: Quick Test (Without Full Data)
```bash
# 1. Start Elasticsearch (if not running)
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
# 2. Ingest sample data (100 documents for quick test)
cd data/customer1
python ingest_customer1.py \
--limit 100 \
--recreate-index \
--es-host http://localhost:9200 \
--skip-embeddings
# 3. Start API service
cd ../..
|
2a76641e
tangwang
config
|
49
|
python -m api.app --host 0.0.0.0 --port 6002
|
be52af70
tangwang
first commit
|
50
51
|
# 4. Test search
|
2a76641e
tangwang
config
|
52
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
|
-H "Content-Type: application/json" \
-d '{"query": "消防", "size": 5}'
```
### Option 2: Full System with Embeddings
```bash
# 1. Start Elasticsearch
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0
# 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
cd data/customer1
python ingest_customer1.py \
--csv goods_with_pic.5years_congku.csv.shuf.1w \
--recreate-index \
--batch-size 100 \
--es-host http://localhost:9200
# 3. Start API service
cd ../..
python -m api.app \
--host 0.0.0.0 \
|
2a76641e
tangwang
config
|
75
|
--port 6002 \
|
be52af70
tangwang
first commit
|
76
77
78
79
80
|
--customer customer1 \
--es-host http://localhost:9200
# 4. Test various searches
# Simple search
|
2a76641e
tangwang
config
|
81
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
82
83
84
85
|
-H "Content-Type: application/json" \
-d '{"query": "芭比娃娃", "size": 10}'
# Boolean search
|
2a76641e
tangwang
config
|
86
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
87
88
89
90
|
-H "Content-Type: application/json" \
-d '{"query": "toy AND (barbie OR doll)", "size": 10}'
# With filters
|
2a76641e
tangwang
config
|
91
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
92
93
94
95
96
97
98
|
-H "Content-Type: application/json" \
-d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'
```
## API Documentation
Once the service is running, visit:
|
2a76641e
tangwang
config
|
99
100
|
- **Swagger UI**: http://localhost:6002/docs
- **ReDoc**: http://localhost:6002/redoc
|
be52af70
tangwang
first commit
|
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
|
## Common Issues
### Issue: Elasticsearch connection failed
**Solution**: Ensure Elasticsearch is running and accessible
```bash
curl http://localhost:9200
```
### Issue: Model download fails
**Solution**: Check internet connection, models are downloaded from Hugging Face/ModelScope
```bash
# Pre-download models (optional)
python -c "from embeddings import BgeEncoder; BgeEncoder()"
python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"
```
### Issue: Out of memory during embedding generation
**Solution**: Reduce batch size or skip embeddings initially
```bash
python ingest_customer1.py --skip-embeddings --limit 1000
```
### Issue: Translation not working
**Solution**: Set DeepL API key or translations will use mock mode (returns original text)
```bash
export DEEPL_API_KEY="your-key"
```
## Testing
### Test Health
```bash
|
2a76641e
tangwang
config
|
134
|
curl http://localhost:6002/admin/health
|
be52af70
tangwang
first commit
|
135
136
137
138
|
```
### Test Configuration
```bash
|
2a76641e
tangwang
config
|
139
|
curl http://localhost:6002/admin/config
|
be52af70
tangwang
first commit
|
140
141
142
143
|
```
### Test Index Stats
```bash
|
2a76641e
tangwang
config
|
144
|
curl http://localhost:6002/admin/stats
|
be52af70
tangwang
first commit
|
145
146
147
148
149
|
```
### Test Search
```bash
# Chinese query (auto-translates to English/Russian)
|
2a76641e
tangwang
config
|
150
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
151
152
153
154
|
-H "Content-Type: application/json" \
-d '{"query": "消防套", "size": 5}'
# English query
|
2a76641e
tangwang
config
|
155
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
156
157
158
159
|
-H "Content-Type: application/json" \
-d '{"query": "fire control set", "size": 5}'
# Russian query
|
2a76641e
tangwang
config
|
160
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
|
-H "Content-Type: application/json" \
-d '{"query": "Наборы для пожаротушения", "size": 5}'
```
## What's Next?
1. **Customize Configuration**: Edit `config/schema/customer1_config.yaml`
2. **Add More Data**: Ingest your own product data
3. **Tune Ranking**: Adjust ranking expression in config
4. **Add Rewrite Rules**: Update via API `/admin/rewrite-rules`
5. **Monitor Performance**: Check `/admin/stats` endpoint
6. **Scale**: Deploy to production with proper ES cluster
## Architecture Quick Reference
```
Query Flow:
User Query → QueryParser (normalize, rewrite, translate, embed)
→ Searcher (boolean parse, build ES query)
→ Elasticsearch (BM25 + KNN)
→ RankingEngine (custom scoring)
→ Results
Indexing Flow:
CSV Data → DataTransformer (field mapping, embeddings)
→ BulkIndexer (batch processing)
→ Elasticsearch
```
## Support
For issues or questions, refer to:
- **README.md**: Comprehensive documentation
- **IMPLEMENTATION_SUMMARY.md**: Technical details
- **CLAUDE.md**: Development guidelines
|
2a76641e
tangwang
config
|
196
|
- **API Docs**: http://localhost:6002/docs
|