QUICKSTART.md

# Quick Start Guide
## Prerequisites
1. **Python 3.8+**
2. **Elasticsearch 8.x** (running on localhost:9200 or remote)
3. **Optional**: CUDA-enabled GPU for faster embeddings
## Installation
### 1. Install Dependencies
```bash
cd /data/tw/SearchEngine
pip install -r requirements.txt
```
### 2. Set Environment Variables (Optional)
```bash
# Elasticsearch
export ES_HOST="http://localhost:9200"
# DeepL API (for translation)
export DEEPL_API_KEY="your-api-key-here"
# Customer ID
export CUSTOMER_ID="customer1"
```
## Running the System
### Option 1: Quick Test (Without Full Data)
```bash
# 1. Start Elasticsearch (if not running)
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
# 2. Ingest sample data (100 documents for quick test)
cd data/customer1
python ingest_customer1.py \
  --limit 100 \
  --recreate-index \
  --es-host http://localhost:9200 \
  --skip-embeddings
# 3. Start API service
cd ../..
python -m api.app --host 0.0.0.0 --port 6002
# 4. Test search
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "消防", "size": 5}'
```
### Option 2: Full System with Embeddings
```bash
# 1. Start Elasticsearch
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0
# 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
cd data/customer1
python ingest_customer1.py \
  --csv goods_with_pic.5years_congku.csv.shuf.1w \
  --recreate-index \
  --batch-size 100 \
  --es-host http://localhost:9200
# 3. Start API service
cd ../..
python -m api.app \
  --host 0.0.0.0 \
  --port 6002 \
  --customer customer1 \
  --es-host http://localhost:9200
# 4. Test various searches
# Simple search
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "芭比娃娃", "size": 10}'
# Boolean search
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "toy AND (barbie OR doll)", "size": 10}'
# With filters
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'
```
## API Documentation
Once the service is running, visit:
- **Swagger UI**: http://localhost:6002/docs
- **ReDoc**: http://localhost:6002/redoc
## Common Issues
### Issue: Elasticsearch connection failed
**Solution**: Ensure Elasticsearch is running and accessible
```bash
curl http://localhost:9200
```
### Issue: Model download fails
**Solution**: Check internet connection, models are downloaded from Hugging Face/ModelScope
```bash
# Pre-download models (optional)
python -c "from embeddings import BgeEncoder; BgeEncoder()"
python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"
```
### Issue: Out of memory during embedding generation
**Solution**: Reduce batch size or skip embeddings initially
```bash
python ingest_customer1.py --skip-embeddings --limit 1000
```
### Issue: Translation not working
**Solution**: Set DeepL API key or translations will use mock mode (returns original text)
```bash
export DEEPL_API_KEY="your-key"
```
## Testing
### Test Health
```bash
curl http://localhost:6002/admin/health
```
### Test Configuration
```bash
curl http://localhost:6002/admin/config
```
### Test Index Stats
```bash
curl http://localhost:6002/admin/stats
```
### Test Search
```bash
# Chinese query (auto-translates to English/Russian)
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "消防套", "size": 5}'
# English query
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "fire control set", "size": 5}'
# Russian query
curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "Наборы для пожаротушения", "size": 5}'
```
## What's Next?
1. **Customize Configuration**: Edit `config/schema/customer1_config.yaml`
2. **Add More Data**: Ingest your own product data
3. **Tune Ranking**: Adjust ranking expression in config
4. **Add Rewrite Rules**: Update via API `/admin/rewrite-rules`
5. **Monitor Performance**: Check `/admin/stats` endpoint
6. **Scale**: Deploy to production with proper ES cluster
## Architecture Quick Reference
```
Query Flow:
User Query → QueryParser (normalize, rewrite, translate, embed)
         → Searcher (boolean parse, build ES query)
         → Elasticsearch (BM25 + KNN)
         → RankingEngine (custom scoring)
         → Results
Indexing Flow:
CSV Data → DataTransformer (field mapping, embeddings)
        → BulkIndexer (batch processing)
        → Elasticsearch
```
## Support
For issues or questions, refer to:
- **README.md**: Comprehensive documentation
- **IMPLEMENTATION_SUMMARY.md**: Technical details
- **CLAUDE.md**: Development guidelines
- **API Docs**: http://localhost:6002/docs