Blame view

QUICKSTART.md 4.9 KB
be52af70   tangwang   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
  # Quick Start Guide
  
  ## Prerequisites
  
  1. **Python 3.8+**
  2. **Elasticsearch 8.x** (running on localhost:9200 or remote)
  3. **Optional**: CUDA-enabled GPU for faster embeddings
  
  ## Installation
  
  ### 1. Install Dependencies
  
  ```bash
  cd /data/tw/SearchEngine
  pip install -r requirements.txt
  ```
  
  ### 2. Set Environment Variables (Optional)
  
  ```bash
  # Elasticsearch
  export ES_HOST="http://localhost:9200"
  
  # DeepL API (for translation)
  export DEEPL_API_KEY="your-api-key-here"
  
  # Customer ID
  export CUSTOMER_ID="customer1"
  ```
  
  ## Running the System
  
  ### Option 1: Quick Test (Without Full Data)
  
  ```bash
  # 1. Start Elasticsearch (if not running)
  docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
  
  # 2. Ingest sample data (100 documents for quick test)
  cd data/customer1
  python ingest_customer1.py \
    --limit 100 \
    --recreate-index \
    --es-host http://localhost:9200 \
    --skip-embeddings
  
  # 3. Start API service
  cd ../..
2a76641e   tangwang   config
49
  python -m api.app --host 0.0.0.0 --port 6002
be52af70   tangwang   first commit
50
51
  
  # 4. Test search
2a76641e   tangwang   config
52
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
    -H "Content-Type: application/json" \
    -d '{"query": "消防", "size": 5}'
  ```
  
  ### Option 2: Full System with Embeddings
  
  ```bash
  # 1. Start Elasticsearch
  docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0
  
  # 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
  cd data/customer1
  python ingest_customer1.py \
    --csv goods_with_pic.5years_congku.csv.shuf.1w \
    --recreate-index \
    --batch-size 100 \
    --es-host http://localhost:9200
  
  # 3. Start API service
  cd ../..
  python -m api.app \
    --host 0.0.0.0 \
2a76641e   tangwang   config
75
    --port 6002 \
be52af70   tangwang   first commit
76
77
78
79
80
    --customer customer1 \
    --es-host http://localhost:9200
  
  # 4. Test various searches
  # Simple search
2a76641e   tangwang   config
81
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
82
83
84
85
    -H "Content-Type: application/json" \
    -d '{"query": "芭比娃娃", "size": 10}'
  
  # Boolean search
2a76641e   tangwang   config
86
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
87
88
89
90
    -H "Content-Type: application/json" \
    -d '{"query": "toy AND (barbie OR doll)", "size": 10}'
  
  # With filters
2a76641e   tangwang   config
91
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
92
93
94
95
96
97
98
    -H "Content-Type: application/json" \
    -d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'
  ```
  
  ## API Documentation
  
  Once the service is running, visit:
2a76641e   tangwang   config
99
100
  - **Swagger UI**: http://localhost:6002/docs
  - **ReDoc**: http://localhost:6002/redoc
be52af70   tangwang   first commit
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
  
  ## Common Issues
  
  ### Issue: Elasticsearch connection failed
  **Solution**: Ensure Elasticsearch is running and accessible
  ```bash
  curl http://localhost:9200
  ```
  
  ### Issue: Model download fails
  **Solution**: Check internet connection, models are downloaded from Hugging Face/ModelScope
  ```bash
  # Pre-download models (optional)
  python -c "from embeddings import BgeEncoder; BgeEncoder()"
  python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"
  ```
  
  ### Issue: Out of memory during embedding generation
  **Solution**: Reduce batch size or skip embeddings initially
  ```bash
  python ingest_customer1.py --skip-embeddings --limit 1000
  ```
  
  ### Issue: Translation not working
  **Solution**: Set DeepL API key or translations will use mock mode (returns original text)
  ```bash
  export DEEPL_API_KEY="your-key"
  ```
  
  ## Testing
  
  ### Test Health
  ```bash
2a76641e   tangwang   config
134
  curl http://localhost:6002/admin/health
be52af70   tangwang   first commit
135
136
137
138
  ```
  
  ### Test Configuration
  ```bash
2a76641e   tangwang   config
139
  curl http://localhost:6002/admin/config
be52af70   tangwang   first commit
140
141
142
143
  ```
  
  ### Test Index Stats
  ```bash
2a76641e   tangwang   config
144
  curl http://localhost:6002/admin/stats
be52af70   tangwang   first commit
145
146
147
148
149
  ```
  
  ### Test Search
  ```bash
  # Chinese query (auto-translates to English/Russian)
2a76641e   tangwang   config
150
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
151
152
153
154
    -H "Content-Type: application/json" \
    -d '{"query": "消防套", "size": 5}'
  
  # English query
2a76641e   tangwang   config
155
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
156
157
158
159
    -H "Content-Type: application/json" \
    -d '{"query": "fire control set", "size": 5}'
  
  # Russian query
2a76641e   tangwang   config
160
  curl -X POST http://localhost:6002/search/ \
be52af70   tangwang   first commit
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
    -H "Content-Type: application/json" \
    -d '{"query": "Наборы для пожаротушения", "size": 5}'
  ```
  
  ## What's Next?
  
  1. **Customize Configuration**: Edit `config/schema/customer1_config.yaml`
  2. **Add More Data**: Ingest your own product data
  3. **Tune Ranking**: Adjust ranking expression in config
  4. **Add Rewrite Rules**: Update via API `/admin/rewrite-rules`
  5. **Monitor Performance**: Check `/admin/stats` endpoint
  6. **Scale**: Deploy to production with proper ES cluster
  
  ## Architecture Quick Reference
  
  ```
  Query Flow:
  User Query → QueryParser (normalize, rewrite, translate, embed)
           → Searcher (boolean parse, build ES query)
           → Elasticsearch (BM25 + KNN)
           → RankingEngine (custom scoring)
           → Results
  
  Indexing Flow:
  CSV Data → DataTransformer (field mapping, embeddings)
          → BulkIndexer (batch processing)
          → Elasticsearch
  ```
  
  ## Support
  
  For issues or questions, refer to:
  - **README.md**: Comprehensive documentation
  - **IMPLEMENTATION_SUMMARY.md**: Technical details
  - **CLAUDE.md**: Development guidelines
2a76641e   tangwang   config
196
  - **API Docs**: http://localhost:6002/docs