QUICKSTART.md
4.9 KB
Quick Start Guide
Prerequisites
- Python 3.8+
- Elasticsearch 8.x (running on localhost:9200 or remote)
- Optional: CUDA-enabled GPU for faster embeddings
Installation
1. Install Dependencies
cd /data/tw/SearchEngine
pip install -r requirements.txt
2. Set Environment Variables (Optional)
# Elasticsearch
export ES_HOST="http://localhost:9200"
# DeepL API (for translation)
export DEEPL_API_KEY="your-api-key-here"
# Customer ID
export CUSTOMER_ID="customer1"
Running the System
Option 1: Quick Test (Without Full Data)
# 1. Start Elasticsearch (if not running)
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0
# 2. Ingest sample data (100 documents for quick test)
cd data/customer1
python ingest_customer1.py \
--limit 100 \
--recreate-index \
--es-host http://localhost:9200 \
--skip-embeddings
# 3. Start API service
cd ../..
python -m api.app --host 0.0.0.0 --port 8000
# 4. Test search
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "消防", "size": 5}'
Option 2: Full System with Embeddings
# 1. Start Elasticsearch
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0
# 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
cd data/customer1
python ingest_customer1.py \
--csv goods_with_pic.5years_congku.csv.shuf.1w \
--recreate-index \
--batch-size 100 \
--es-host http://localhost:9200
# 3. Start API service
cd ../..
python -m api.app \
--host 0.0.0.0 \
--port 8000 \
--customer customer1 \
--es-host http://localhost:9200
# 4. Test various searches
# Simple search
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "芭比娃娃", "size": 10}'
# Boolean search
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "toy AND (barbie OR doll)", "size": 10}'
# With filters
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'
API Documentation
Once the service is running, visit:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Common Issues
Issue: Elasticsearch connection failed
Solution: Ensure Elasticsearch is running and accessible
curl http://localhost:9200
Issue: Model download fails
Solution: Check internet connection, models are downloaded from Hugging Face/ModelScope
# Pre-download models (optional)
python -c "from embeddings import BgeEncoder; BgeEncoder()"
python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"
Issue: Out of memory during embedding generation
Solution: Reduce batch size or skip embeddings initially
python ingest_customer1.py --skip-embeddings --limit 1000
Issue: Translation not working
Solution: Set DeepL API key or translations will use mock mode (returns original text)
export DEEPL_API_KEY="your-key"
Testing
Test Health
curl http://localhost:8000/admin/health
Test Configuration
curl http://localhost:8000/admin/config
Test Index Stats
curl http://localhost:8000/admin/stats
Test Search
# Chinese query (auto-translates to English/Russian)
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "消防套", "size": 5}'
# English query
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "fire control set", "size": 5}'
# Russian query
curl -X POST http://localhost:8000/search/ \
-H "Content-Type: application/json" \
-d '{"query": "Наборы для пожаротушения", "size": 5}'
What's Next?
- Customize Configuration: Edit
config/schema/customer1_config.yaml - Add More Data: Ingest your own product data
- Tune Ranking: Adjust ranking expression in config
- Add Rewrite Rules: Update via API
/admin/rewrite-rules - Monitor Performance: Check
/admin/statsendpoint - Scale: Deploy to production with proper ES cluster
Architecture Quick Reference
Query Flow:
User Query → QueryParser (normalize, rewrite, translate, embed)
→ Searcher (boolean parse, build ES query)
→ Elasticsearch (BM25 + KNN)
→ RankingEngine (custom scoring)
→ Results
Indexing Flow:
CSV Data → DataTransformer (field mapping, embeddings)
→ BulkIndexer (batch processing)
→ Elasticsearch
Support
For issues or questions, refer to:
- README.md: Comprehensive documentation
- IMPLEMENTATION_SUMMARY.md: Technical details
- CLAUDE.md: Development guidelines
- API Docs: http://localhost:8000/docs