# Quick Start Guide ## Prerequisites 1. **Python 3.8+** 2. **Elasticsearch 8.x** (running on localhost:9200 or remote) 3. **Optional**: CUDA-enabled GPU for faster embeddings ## Installation ### 1. Install Dependencies ```bash cd /data/tw/SearchEngine pip install -r requirements.txt ``` ### 2. Set Environment Variables (Optional) ```bash # Elasticsearch export ES_HOST="http://localhost:9200" # DeepL API (for translation) export DEEPL_API_KEY="your-api-key-here" # Customer ID export CUSTOMER_ID="customer1" ``` ## Running the System ### Option 1: Quick Test (Without Full Data) ```bash # 1. Start Elasticsearch (if not running) docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0 # 2. Ingest sample data (100 documents for quick test) cd data/customer1 python ingest_customer1.py \ --limit 100 \ --recreate-index \ --es-host http://localhost:9200 \ --skip-embeddings # 3. Start API service cd ../.. python -m api.app --host 0.0.0.0 --port 6002 # 4. Test search curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "消防", "size": 5}' ``` ### Option 2: Full System with Embeddings ```bash # 1. Start Elasticsearch docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0 # 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min) cd data/customer1 python ingest_customer1.py \ --csv goods_with_pic.5years_congku.csv.shuf.1w \ --recreate-index \ --batch-size 100 \ --es-host http://localhost:9200 # 3. Start API service cd ../.. python -m api.app \ --host 0.0.0.0 \ --port 6002 \ --customer customer1 \ --es-host http://localhost:9200 # 4. Test various searches # Simple search curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "芭比娃娃", "size": 10}' # Boolean search curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "toy AND (barbie OR doll)", "size": 10}' # With filters curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}' ``` ## API Documentation Once the service is running, visit: - **Swagger UI**: http://localhost:6002/docs - **ReDoc**: http://localhost:6002/redoc ## Common Issues ### Issue: Elasticsearch connection failed **Solution**: Ensure Elasticsearch is running and accessible ```bash curl http://localhost:9200 ``` ### Issue: Model download fails **Solution**: Check internet connection, models are downloaded from Hugging Face/ModelScope ```bash # Pre-download models (optional) python -c "from embeddings import BgeEncoder; BgeEncoder()" python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()" ``` ### Issue: Out of memory during embedding generation **Solution**: Reduce batch size or skip embeddings initially ```bash python ingest_customer1.py --skip-embeddings --limit 1000 ``` ### Issue: Translation not working **Solution**: Set DeepL API key or translations will use mock mode (returns original text) ```bash export DEEPL_API_KEY="your-key" ``` ## Testing ### Test Health ```bash curl http://localhost:6002/admin/health ``` ### Test Configuration ```bash curl http://localhost:6002/admin/config ``` ### Test Index Stats ```bash curl http://localhost:6002/admin/stats ``` ### Test Search ```bash # Chinese query (auto-translates to English/Russian) curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "消防套", "size": 5}' # English query curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "fire control set", "size": 5}' # Russian query curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{"query": "Наборы для пожаротушения", "size": 5}' ``` ## What's Next? 1. **Customize Configuration**: Edit `config/schema/customer1_config.yaml` 2. **Add More Data**: Ingest your own product data 3. **Tune Ranking**: Adjust ranking expression in config 4. **Add Rewrite Rules**: Update via API `/admin/rewrite-rules` 5. **Monitor Performance**: Check `/admin/stats` endpoint 6. **Scale**: Deploy to production with proper ES cluster ## Architecture Quick Reference ``` Query Flow: User Query → QueryParser (normalize, rewrite, translate, embed) → Searcher (boolean parse, build ES query) → Elasticsearch (BM25 + KNN) → RankingEngine (custom scoring) → Results Indexing Flow: CSV Data → DataTransformer (field mapping, embeddings) → BulkIndexer (batch processing) → Elasticsearch ``` ## Support For issues or questions, refer to: - **README.md**: Comprehensive documentation - **IMPLEMENTATION_SUMMARY.md**: Technical details - **CLAUDE.md**: Development guidelines - **API Docs**: http://localhost:6002/docs