Blame view

QUICKSTART.md 4.9 KB
be52af70   tangwang   first commit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
  # Quick Start Guide
  
  ## Prerequisites
  
  1. **Python 3.8+**
  2. **Elasticsearch 8.x** (running on localhost:9200 or remote)
  3. **Optional**: CUDA-enabled GPU for faster embeddings
  
  ## Installation
  
  ### 1. Install Dependencies
  
  ```bash
  cd /data/tw/SearchEngine
  pip install -r requirements.txt
  ```
  
  ### 2. Set Environment Variables (Optional)
  
  ```bash
  # Elasticsearch
  export ES_HOST="http://localhost:9200"
  
  # DeepL API (for translation)
  export DEEPL_API_KEY="your-api-key-here"
  
  # Customer ID
  export CUSTOMER_ID="customer1"
  ```
  
  ## Running the System
  
  ### Option 1: Quick Test (Without Full Data)
  
  ```bash
  # 1. Start Elasticsearch (if not running)
  docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
  
  # 2. Ingest sample data (100 documents for quick test)
  cd data/customer1
  python ingest_customer1.py \
    --limit 100 \
    --recreate-index \
    --es-host http://localhost:9200 \
    --skip-embeddings
  
  # 3. Start API service
  cd ../..
  python -m api.app --host 0.0.0.0 --port 8000
  
  # 4. Test search
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "消防", "size": 5}'
  ```
  
  ### Option 2: Full System with Embeddings
  
  ```bash
  # 1. Start Elasticsearch
  docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0
  
  # 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
  cd data/customer1
  python ingest_customer1.py \
    --csv goods_with_pic.5years_congku.csv.shuf.1w \
    --recreate-index \
    --batch-size 100 \
    --es-host http://localhost:9200
  
  # 3. Start API service
  cd ../..
  python -m api.app \
    --host 0.0.0.0 \
    --port 8000 \
    --customer customer1 \
    --es-host http://localhost:9200
  
  # 4. Test various searches
  # Simple search
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "芭比娃娃", "size": 10}'
  
  # Boolean search
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "toy AND (barbie OR doll)", "size": 10}'
  
  # With filters
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'
  ```
  
  ## API Documentation
  
  Once the service is running, visit:
  - **Swagger UI**: http://localhost:8000/docs
  - **ReDoc**: http://localhost:8000/redoc
  
  ## Common Issues
  
  ### Issue: Elasticsearch connection failed
  **Solution**: Ensure Elasticsearch is running and accessible
  ```bash
  curl http://localhost:9200
  ```
  
  ### Issue: Model download fails
  **Solution**: Check internet connection, models are downloaded from Hugging Face/ModelScope
  ```bash
  # Pre-download models (optional)
  python -c "from embeddings import BgeEncoder; BgeEncoder()"
  python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"
  ```
  
  ### Issue: Out of memory during embedding generation
  **Solution**: Reduce batch size or skip embeddings initially
  ```bash
  python ingest_customer1.py --skip-embeddings --limit 1000
  ```
  
  ### Issue: Translation not working
  **Solution**: Set DeepL API key or translations will use mock mode (returns original text)
  ```bash
  export DEEPL_API_KEY="your-key"
  ```
  
  ## Testing
  
  ### Test Health
  ```bash
  curl http://localhost:8000/admin/health
  ```
  
  ### Test Configuration
  ```bash
  curl http://localhost:8000/admin/config
  ```
  
  ### Test Index Stats
  ```bash
  curl http://localhost:8000/admin/stats
  ```
  
  ### Test Search
  ```bash
  # Chinese query (auto-translates to English/Russian)
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "消防套", "size": 5}'
  
  # English query
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "fire control set", "size": 5}'
  
  # Russian query
  curl -X POST http://localhost:8000/search/ \
    -H "Content-Type: application/json" \
    -d '{"query": "Наборы для пожаротушения", "size": 5}'
  ```
  
  ## What's Next?
  
  1. **Customize Configuration**: Edit `config/schema/customer1_config.yaml`
  2. **Add More Data**: Ingest your own product data
  3. **Tune Ranking**: Adjust ranking expression in config
  4. **Add Rewrite Rules**: Update via API `/admin/rewrite-rules`
  5. **Monitor Performance**: Check `/admin/stats` endpoint
  6. **Scale**: Deploy to production with proper ES cluster
  
  ## Architecture Quick Reference
  
  ```
  Query Flow:
  User Query → QueryParser (normalize, rewrite, translate, embed)
           → Searcher (boolean parse, build ES query)
           → Elasticsearch (BM25 + KNN)
           → RankingEngine (custom scoring)
           → Results
  
  Indexing Flow:
  CSV Data → DataTransformer (field mapping, embeddings)
          → BulkIndexer (batch processing)
          → Elasticsearch
  ```
  
  ## Support
  
  For issues or questions, refer to:
  - **README.md**: Comprehensive documentation
  - **IMPLEMENTATION_SUMMARY.md**: Technical details
  - **CLAUDE.md**: Development guidelines
  - **API Docs**: http://localhost:8000/docs