README.md 11.7 KB
Edit Raw Blame History


Search Evaluation Framework
This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.

It is designed around one core rule:


Annotation should be built offline first.
Single-query evaluation should then map recalled spu_id values to the cached annotation set.
Recalled items without cached labels are treated as Irrelevant during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.

Goals
The framework supports four related tasks:


Build an annotation set for a fixed query set.
Evaluate a live search result list against that annotation set.
Run batch evaluation and store historical reports with config snapshots.
Tune fusion parameters reproducibly.

Files

eval_framework.py
Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.
build_annotation_set.py
Thin CLI entrypoint into eval_framework.py.
serve_eval_web.py
Thin web entrypoint into eval_framework.py.
tune_fusion.py
Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
fusion_experiments_shortlist.json
A compact experiment set for practical tuning.
fusion_experiments_round1.json
A broader first-round experiment set.
queries/queries.txt
The canonical evaluation query set.
README_Requirement.md
Requirement reference document.
quick_start_eval.sh
Optional wrapper: batch (fill missing labels only), batch-rebuild (force full re-label), or serve (web UI).
../start_eval_web.sh
Same as serve but loads activate.sh; use with ./scripts/service_ctl.sh start eval-web (port EVAL_WEB_PORT, default 6010). ./run.sh all starts eval-web with the rest of core services.

Quick start (from repo root)
Set tenant if needed (export TENANT_ID=163). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
./scripts/evaluation/quick_start_eval.sh batch

# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
./scripts/evaluation/quick_start_eval.sh batch-rebuild

# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
./scripts/evaluation/quick_start_eval.sh serve
# Or: ./scripts/service_ctl.sh start eval-web


Equivalent explicit commands:
# Safe default: no --force-refresh-labels
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple \
  --force-refresh-labels

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010


Batch behavior: There is no “skip queries already processed”. Each run walks the full queries file. With --force-refresh-labels, for every query the runner issues a live search and sends all top_k returned spu_ids through the LLM again (SQLite rows are upserted). Omit --force-refresh-labels if you only want to fill in labels that are missing for the current recall window.
Storage Layout
All generated artifacts are under:


/data/saas-search/artifacts/search_evaluation


Important subpaths:


/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3
Main cache and annotation store.
/data/saas-search/artifacts/search_evaluation/query_builds
Per-query pooled annotation-set build artifacts.
/data/saas-search/artifacts/search_evaluation/batch_reports
Batch evaluation JSON, Markdown reports, and config snapshots.
/data/saas-search/artifacts/search_evaluation/audits
Audit summaries for label quality checks.
/data/saas-search/artifacts/search_evaluation/tuning_runs
Fusion experiment summaries and per-experiment config snapshots.

SQLite Schema Summary
The main tables in search_eval.sqlite3 are:


corpus_docs
Cached product corpus for the tenant.
rerank_scores
Cached full-corpus reranker scores keyed by (tenant_id, query_text, spu_id).
relevance_labels
Cached LLM relevance labels keyed by (tenant_id, query_text, spu_id).
query_profiles
Structured query-intent profiles extracted before labeling.
build_runs
Per-query pooled-build records.
batch_runs
Batch evaluation history.

Label Semantics
Three labels are used throughout:


Exact
Fully matches the intended product type and all explicit required attributes.
Partial
Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
Irrelevant
Product type mismatches, or explicit required attributes conflict.


The framework always uses:


LLM-based batched relevance classification
caching and retry logic for robust offline labeling


There are now two labeler modes:


simple
Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
complex
Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.

Offline-First Workflow
1. Refresh labels for the evaluation query set
For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (P@5, P@10, P@20, P@50, MAP_3, MAP_2_3), a top_k=50 cached label set is sufficient.

Example (fills missing labels only; recommended default):
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple


To rebuild every label for the current top_k recall window (all queries, all hits re-sent to the LLM), add --force-refresh-labels or run ./scripts/evaluation/quick_start_eval.sh batch-rebuild.

This command does two things:


runs every query in the file against the live backend (no skip list)
with --force-refresh-labels, re-labels all top_k hits per query via the LLM and upserts SQLite; without the flag, only spu_ids lacking a cached label are sent to the LLM


After this step, single-query evaluation can run in cached mode without calling the LLM again.
2. Optional pooled build
The framework also supports a heavier pooled build that combines:


top search results
top full-corpus reranker results


Example:
./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 1000 \
  --rerank-depth 10000 \
  --annotate-search-top-k 100 \
  --annotate-rerank-top-k 120 \
  --language en


This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
Why Single-Query Evaluation Was Slow
If single-query evaluation is slow, the usual reason is that it is still running with auto_annotate=true, which means:


perform live search
detect recalled but unlabeled products
call the LLM to label them


That is not the intended steady-state evaluation path.

The UI/API is now configured to prefer cached evaluation:


default single-query evaluation uses auto_annotate=false
unlabeled recalled results are treated as Irrelevant
the response includes tips explaining that coverage gap


If you want stable, fast evaluation:


prebuild labels offline
use cached single-query evaluation

Web UI
Start the evaluation UI:
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010


The UI provides:


query list loaded from queries.txt
single-query evaluation
batch evaluation
history of batch reports
top recalled results
missed Exact and Partial products that were not recalled
tips about unlabeled hits treated as Irrelevant

Single-query response behavior
For a single query:


live search returns recalled spu_id values
the framework looks up cached labels by (query, spu_id)
unlabeled recalled items are counted as Irrelevant
cached Exact and Partial products that were not recalled are listed under Missed Exact / Partial


This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
CLI Commands
Build pooled annotation artifacts
./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...

Run batch evaluation
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple


Use --force-refresh-labels if you want to rebuild the offline label cache for the recalled window first.
Audit annotation quality
./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple


This checks cached labels against current guardrails and reports suspicious cases.
Batch Reports
Each batch run stores:


aggregate metrics
per-query metrics
label distribution
timestamp
config snapshot from /admin/config


Reports are written as:


Markdown for easy reading
JSON for downstream processing

Fusion Tuning
The tuning runner applies experiment configs sequentially and records the outcome.

Example:
./.venv/bin/python scripts/evaluation/tune_fusion.py \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
  --score-metric MAP_3 \
  --apply-best


What it does:


writes an experiment config into config/config.yaml
restarts backend
runs batch evaluation
stores the per-experiment result
optionally applies the best experiment at the end

Current Practical Recommendation
For day-to-day evaluation:


refresh the offline labels for the fixed query set with batch --force-refresh-labels
run the web UI or normal batch evaluation in cached mode
only force-refresh labels again when:


the query set changes
the product corpus changes materially
the labeling logic changes


Caveats

The current label cache is query-specific, not a full all-products all-queries matrix.
Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.

Related Requirement Docs

README_Requirement.md
README_Requirement_zh.md


These documents describe the original problem statement. This README.md describes the implemented framework and the current recommended workflow.