QUICKSTART.md 4.9 KB

Quick Start Guide

Prerequisites

  1. Python 3.8+
  2. Elasticsearch 8.x (running on localhost:9200 or remote)
  3. Optional: CUDA-enabled GPU for faster embeddings

Installation

1. Install Dependencies

cd /data/tw/SearchEngine
pip install -r requirements.txt

2. Set Environment Variables (Optional)

# Elasticsearch
export ES_HOST="http://localhost:9200"

# DeepL API (for translation)
export DEEPL_API_KEY="your-api-key-here"

# Customer ID
export CUSTOMER_ID="customer1"

Running the System

Option 1: Quick Test (Without Full Data)

# 1. Start Elasticsearch (if not running)
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:8.11.0

# 2. Ingest sample data (100 documents for quick test)
cd data/customer1
python ingest_customer1.py \
  --limit 100 \
  --recreate-index \
  --es-host http://localhost:9200 \
  --skip-embeddings

# 3. Start API service
cd ../..
python -m api.app --host 0.0.0.0 --port 8000

# 4. Test search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "消防", "size": 5}'

Option 2: Full System with Embeddings

# 1. Start Elasticsearch
docker run -d -p 9200:9200 -e "discovery.type=single-node" -e "ES_JAVA_OPTS=-Xms4g -Xmx4g" elasticsearch:8.11.0

# 2. Ingest full dataset with embeddings (requires GPU, takes ~10-30 min)
cd data/customer1
python ingest_customer1.py \
  --csv goods_with_pic.5years_congku.csv.shuf.1w \
  --recreate-index \
  --batch-size 100 \
  --es-host http://localhost:9200

# 3. Start API service
cd ../..
python -m api.app \
  --host 0.0.0.0 \
  --port 8000 \
  --customer customer1 \
  --es-host http://localhost:9200

# 4. Test various searches
# Simple search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "芭比娃娃", "size": 10}'

# Boolean search
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "toy AND (barbie OR doll)", "size": 10}'

# With filters
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "娃娃", "size": 10, "filters": {"categoryName_keyword": "芭比"}}'

API Documentation

Once the service is running, visit:

Common Issues

Issue: Elasticsearch connection failed

Solution: Ensure Elasticsearch is running and accessible

curl http://localhost:9200

Issue: Model download fails

Solution: Check internet connection, models are downloaded from Hugging Face/ModelScope

# Pre-download models (optional)
python -c "from embeddings import BgeEncoder; BgeEncoder()"
python -c "from embeddings import CLIPImageEncoder; CLIPImageEncoder()"

Issue: Out of memory during embedding generation

Solution: Reduce batch size or skip embeddings initially

python ingest_customer1.py --skip-embeddings --limit 1000

Issue: Translation not working

Solution: Set DeepL API key or translations will use mock mode (returns original text)

export DEEPL_API_KEY="your-key"

Testing

Test Health

curl http://localhost:8000/admin/health

Test Configuration

curl http://localhost:8000/admin/config

Test Index Stats

curl http://localhost:8000/admin/stats
# Chinese query (auto-translates to English/Russian)
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "消防套", "size": 5}'

# English query
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "fire control set", "size": 5}'

# Russian query
curl -X POST http://localhost:8000/search/ \
  -H "Content-Type: application/json" \
  -d '{"query": "Наборы для пожаротушения", "size": 5}'

What's Next?

  1. Customize Configuration: Edit config/schema/customer1_config.yaml
  2. Add More Data: Ingest your own product data
  3. Tune Ranking: Adjust ranking expression in config
  4. Add Rewrite Rules: Update via API /admin/rewrite-rules
  5. Monitor Performance: Check /admin/stats endpoint
  6. Scale: Deploy to production with proper ES cluster

Architecture Quick Reference

Query Flow:
User Query → QueryParser (normalize, rewrite, translate, embed)
         → Searcher (boolean parse, build ES query)
         → Elasticsearch (BM25 + KNN)
         → RankingEngine (custom scoring)
         → Results

Indexing Flow:
CSV Data → DataTransformer (field mapping, embeddings)
        → BulkIndexer (batch processing)
        → Elasticsearch

Support

For issues or questions, refer to:

  • README.md: Comprehensive documentation
  • IMPLEMENTATION_SUMMARY.md: Technical details
  • CLAUDE.md: Development guidelines
  • API Docs: http://localhost:8000/docs