be52af70
tangwang
first commit
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
# Quick Start Guide
## Prerequisites
1. **Python 3.8+**
2. **Elasticsearch 8.x** (running on localhost:9200 or remote)
3. **Optional**: CUDA-enabled GPU for faster embeddings
## Installation
### 1. Install Dependencies
```bash
cd /data/tw/SearchEngine
pip install -r requirements.txt
```
### 2. Set Environment Variables (Optional)
```bash
# Elasticsearch
export ES_HOST="http://localhost:9200"
# DeepL API (for translation)
export DEEPL_API_KEY="your-api-key-here"
|
ae5a294d
tangwang
命名修改、代码清理
|
27
28
|
# Tenant ID
export TENANT_ID="tenant1"
|
be52af70
tangwang
first commit
|
29
30
31
32
33
34
35
36
37
38
39
|
```
## Running the System
### Option 1: Quick Test (Without Full Data)
```bash
# 1. Start Elasticsearch (if not running)
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
# 2. Ingest sample data (100 documents for quick test)
|
ae5a294d
tangwang
命名修改、代码清理
|
40
41
|
cd data/tenant1
python ingest_tenant1.py \
|
be52af70
tangwang
first commit
|
42
43
44
45
46
47
48
|
--limit 100 \
--recreate-index \
--es-host http://localhost:9200 \
--skip-embeddings
# 3. Start API service
cd ../..
|
2a76641e
tangwang
config
|
49
|
python -m api.app --host 0.0.0.0 --port 6002
|
be52af70
tangwang
first commit
|
50
51
|
# 4. Test search
|
2a76641e
tangwang
config
|
52
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
53
54
55
56
57
58
59
60
61
62
63
|
-H "Content-Type: application/json" \
-d '{"query": "消防", "size": 5}'
```
### Option 2: Full System with Embeddings
```bash
# 1. Start Elasticsearch
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0
# 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
|
ae5a294d
tangwang
命名修改、代码清理
|
64
65
|
cd data/tenant1
python ingest_tenant1.py \
|
be52af70
tangwang
first commit
|
66
67
68
69
70
71
72
73
74
|
--csv goods_with_pic.5years_congku.csv.shuf.1w \
--recreate-index \
--batch-size 100 \
--es-host http://localhost:9200
# 3. Start API service
cd ../..
python -m api.app \
--host 0.0.0.0 \
|
2a76641e
tangwang
config
|
75
|
--port 6002 \
|
ae5a294d
tangwang
命名修改、代码清理
|
76
|
--tenant tenant1 \
|
be52af70
tangwang
first commit
|
77
78
79
80
|
--es-host http://localhost:9200
# 4. Test various searches
# Simple search
|
2a76641e
tangwang
config
|
81
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
82
83
84
85
|
-H "Content-Type: application/json" \
-d '{"query": "芭比娃娃", "size": 10}'
# Boolean search
|
2a76641e
tangwang
config
|
86
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
87
88
89
90
|
-H "Content-Type: application/json" \
-d '{"query": "toy AND (barbie OR doll)", "size": 10}'
# With filters
|
2a76641e
tangwang
config
|
91
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
92
93
94
95
96
97
98
|
-H "Content-Type: application/json" \
-d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'
```
## API Documentation
Once the service is running, visit:
|
2a76641e
tangwang
config
|
99
100
|
- **Swagger UI**: http://localhost:6002/docs
- **ReDoc**: http://localhost:6002/redoc
|
be52af70
tangwang
first commit
|
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
|
## Common Issues
### Issue: Elasticsearch connection failed
**Solution**: Ensure Elasticsearch is running and accessible
```bash
curl http://localhost:9200
```
### Issue: Model download fails
**Solution**: Check internet connection, models are downloaded from Hugging Face/ModelScope
```bash
# Pre-download models (optional)
python -c "from embeddings import BgeEncoder; BgeEncoder()"
python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"
```
### Issue: Out of memory during embedding generation
**Solution**: Reduce batch size or skip embeddings initially
```bash
|
ae5a294d
tangwang
命名修改、代码清理
|
121
|
python ingest_tenant1.py --skip-embeddings --limit 1000
|
be52af70
tangwang
first commit
|
122
123
124
125
126
127
128
129
130
131
132
133
|
```
### Issue: Translation not working
**Solution**: Set DeepL API key or translations will use mock mode (returns original text)
```bash
export DEEPL_API_KEY="your-key"
```
## Testing
### Test Health
```bash
|
2a76641e
tangwang
config
|
134
|
curl http://localhost:6002/admin/health
|
be52af70
tangwang
first commit
|
135
136
137
138
|
```
### Test Configuration
```bash
|
2a76641e
tangwang
config
|
139
|
curl http://localhost:6002/admin/config
|
be52af70
tangwang
first commit
|
140
141
142
143
|
```
### Test Index Stats
```bash
|
2a76641e
tangwang
config
|
144
|
curl http://localhost:6002/admin/stats
|
be52af70
tangwang
first commit
|
145
146
147
148
149
|
```
### Test Search
```bash
# Chinese query (auto-translates to English/Russian)
|
2a76641e
tangwang
config
|
150
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
151
152
153
154
|
-H "Content-Type: application/json" \
-d '{"query": "消防套", "size": 5}'
# English query
|
2a76641e
tangwang
config
|
155
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
156
157
158
159
|
-H "Content-Type: application/json" \
-d '{"query": "fire control set", "size": 5}'
# Russian query
|
2a76641e
tangwang
config
|
160
|
curl -X POST http://localhost:6002/search/ \
|
be52af70
tangwang
first commit
|
161
162
163
164
165
166
|
-H "Content-Type: application/json" \
-d '{"query": "Наборы для пожаротушения", "size": 5}'
```
## What's Next?
|
ae5a294d
tangwang
命名修改、代码清理
|
167
|
1. **Customize Configuration**: Edit `config/schema/tenant1_config.yaml`
|
be52af70
tangwang
first commit
|
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
|
2. **Add More Data**: Ingest your own product data
3. **Tune Ranking**: Adjust ranking expression in config
4. **Add Rewrite Rules**: Update via API `/admin/rewrite-rules`
5. **Monitor Performance**: Check `/admin/stats` endpoint
6. **Scale**: Deploy to production with proper ES cluster
## Architecture Quick Reference
```
Query Flow:
User Query → QueryParser (normalize, rewrite, translate, embed)
→ Searcher (boolean parse, build ES query)
→ Elasticsearch (BM25 + KNN)
→ RankingEngine (custom scoring)
→ Results
Indexing Flow:
CSV Data → DataTransformer (field mapping, embeddings)
→ BulkIndexer (batch processing)
→ Elasticsearch
```
## Support
For issues or questions, refer to:
- **README.md**: Comprehensive documentation
- **IMPLEMENTATION_SUMMARY.md**: Technical details
- **CLAUDE.md**: Development guidelines
|
2a76641e
tangwang
config
|
196
|
- **API Docs**: http://localhost:6002/docs
|