QUICKSTART.md 4.9 KB
Edit Raw Blame History


Quick Start Guide
Prerequisites

Python 3.8+
Elasticsearch 8.x (running on localhost:9200 or remote)
Optional: CUDA-enabled GPU for faster embeddings

Installation
1. Install Dependencies
cd /data/tw/SearchEngine
pip install -r requirements.txt

2. Set Environment Variables (Optional)
# Elasticsearch
export ES_HOST="http://localhost:9200"

# DeepL API (for translation)
export DEEPL_API_KEY="your-api-key-here"

# Customer ID
export CUSTOMER_ID="customer1"

Running the System
Option 1: Quick Test (Without Full Data)
# 1. Start Elasticsearch (if not running)
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0

# 2. Ingest sample data (100 documents for quick test)
cd data/customer1
python ingest_customer1.py \
  --limit 100 \
  --recreate-index \
  --es-host http://localhost:9200 \
  --skip-embeddings

# 3. Start API service
cd ../..
python -m api.app --host 0.0.0.0 --port 8000

# 4. Test search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "消防", "size": 5}'

Option 2: Full System with Embeddings
# 1. Start Elasticsearch
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0

# 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
cd data/customer1
python ingest_customer1.py \
  --csv goods_with_pic.5years_congku.csv.shuf.1w \
  --recreate-index \
  --batch-size 100 \
  --es-host http://localhost:9200

# 3. Start API service
cd ../..
python -m api.app \
  --host 0.0.0.0 \
  --port 8000 \
  --customer customer1 \
  --es-host http://localhost:9200

# 4. Test various searches
# Simple search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "芭比娃娃", "size": 10}'

# Boolean search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "toy AND (barbie OR doll)", "size": 10}'

# With filters
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'

API Documentation
Once the service is running, visit:


Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Common Issues
Issue: Elasticsearch connection failed
Solution: Ensure Elasticsearch is running and accessible
curl http://localhost:9200

Issue: Model download fails
Solution: Check internet connection, models are downloaded from Hugging Face/ModelScope
# Pre-download models (optional)
python -c "from embeddings import BgeEncoder; BgeEncoder()"
python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"

Issue: Out of memory during embedding generation
Solution: Reduce batch size or skip embeddings initially
python ingest_customer1.py --skip-embeddings --limit 1000

Issue: Translation not working
Solution: Set DeepL API key or translations will use mock mode (returns original text)
export DEEPL_API_KEY="your-key"

Testing
Test Health
curl http://localhost:8000/admin/health

Test Configuration
curl http://localhost:8000/admin/config

Test Index Stats
curl http://localhost:8000/admin/stats

Test Search
# Chinese query (auto-translates to English/Russian)
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "消防套", "size": 5}'

# English query
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "fire control set", "size": 5}'

# Russian query
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "Наборы для пожаротушения", "size": 5}'

What's Next?

Customize Configuration: Edit config/schema/customer1_config.yaml
Add More Data: Ingest your own product data
Tune Ranking: Adjust ranking expression in config
Add Rewrite Rules: Update via API /admin/rewrite-rules
Monitor Performance: Check /admin/stats endpoint
Scale: Deploy to production with proper ES cluster

Architecture Quick Reference
Query Flow:
User Query → QueryParser (normalize, rewrite, translate, embed)
         → Searcher (boolean parse, build ES query)
         → Elasticsearch (BM25 + KNN)
         → RankingEngine (custom scoring)
         → Results

Indexing Flow:
CSV Data → DataTransformer (field mapping, embeddings)
        → BulkIndexer (batch processing)
        → Elasticsearch

Support
For issues or questions, refer to:


README.md: Comprehensive documentation
IMPLEMENTATION_SUMMARY.md: Technical details
CLAUDE.md: Development guidelines
API Docs: http://localhost:8000/docs