README.md
Search Evaluation Framework
This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
It is designed around one core rule:
- Annotation should be built offline first.
- Single-query evaluation should then map recalled
spu_idvalues to the cached annotation set. - Recalled items without cached labels are treated as
Irrelevantduring evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
Goals
The framework supports four related tasks:
- Build an annotation set for a fixed query set.
- Evaluate a live search result list against that annotation set.
- Run batch evaluation and store historical reports with config snapshots.
- Tune fusion parameters reproducibly.
Files
eval_framework.pySearch evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.build_annotation_set.pyThin CLI entrypoint intoeval_framework.py.serve_eval_web.pyThin web entrypoint intoeval_framework.py.tune_fusion.pyFusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.fusion_experiments_shortlist.jsonA compact experiment set for practical tuning.fusion_experiments_round1.jsonA broader first-round experiment set.queries/queries.txtThe canonical evaluation query set.README_Requirement.mdRequirement reference document.quick_start_eval.shOptional wrapper:batch(fill missing labels only),batch-rebuild(force full re-label), orserve(web UI).
Quick start (from repo root)
Set tenant if needed (export TENANT_ID=163). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
./scripts/evaluation/quick_start_eval.sh batch
# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
./scripts/evaluation/quick_start_eval.sh batch-rebuild
# 2) Evaluation UI on http://127.0.0.1:6010/
./scripts/evaluation/quick_start_eval.sh serve
Equivalent explicit commands:
# Safe default: no --force-refresh-labels
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple \
--force-refresh-labels
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
--tenant-id "${TENANT_ID:-163}" \
--queries-file scripts/evaluation/queries/queries.txt \
--host 127.0.0.1 \
--port 6010
Batch behavior: There is no “skip queries already processed”. Each run walks the full queries file. With --force-refresh-labels, for every query the runner issues a live search and sends all top_k returned spu_ids through the LLM again (SQLite rows are upserted). Omit --force-refresh-labels if you only want to fill in labels that are missing for the current recall window.
Storage Layout
All generated artifacts are under:
/data/saas-search/artifacts/search_evaluation
Important subpaths:
/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3Main cache and annotation store./data/saas-search/artifacts/search_evaluation/query_buildsPer-query pooled annotation-set build artifacts./data/saas-search/artifacts/search_evaluation/batch_reportsBatch evaluation JSON, Markdown reports, and config snapshots./data/saas-search/artifacts/search_evaluation/auditsAudit summaries for label quality checks./data/saas-search/artifacts/search_evaluation/tuning_runsFusion experiment summaries and per-experiment config snapshots.
SQLite Schema Summary
The main tables in search_eval.sqlite3 are:
corpus_docsCached product corpus for the tenant.rerank_scoresCached full-corpus reranker scores keyed by(tenant_id, query_text, spu_id).relevance_labelsCached LLM relevance labels keyed by(tenant_id, query_text, spu_id).query_profilesStructured query-intent profiles extracted before labeling.build_runsPer-query pooled-build records.batch_runsBatch evaluation history.
Label Semantics
Three labels are used throughout:
ExactFully matches the intended product type and all explicit required attributes.PartialMain intent matches, but explicit attributes are missing, approximate, or weaker than requested.IrrelevantProduct type mismatches, or explicit required attributes conflict.
The framework always uses:
- LLM-based batched relevance classification
- caching and retry logic for robust offline labeling
There are now two labeler modes:
simpleDefault. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.complexLegacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
Offline-First Workflow
1. Refresh labels for the evaluation query set
For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (P@5, P@10, P@20, P@50, MAP_3, MAP_2_3), a top_k=50 cached label set is sufficient.
Example (fills missing labels only; recommended default):
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
To rebuild every label for the current top_k recall window (all queries, all hits re-sent to the LLM), add --force-refresh-labels or run ./scripts/evaluation/quick_start_eval.sh batch-rebuild.
This command does two things:
- runs every query in the file against the live backend (no skip list)
- with
--force-refresh-labels, re-labels alltop_khits per query via the LLM and upserts SQLite; without the flag, onlyspu_ids lacking a cached label are sent to the LLM
After this step, single-query evaluation can run in cached mode without calling the LLM again.
2. Optional pooled build
The framework also supports a heavier pooled build that combines:
- top search results
- top full-corpus reranker results
Example:
./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--search-depth 1000 \
--rerank-depth 10000 \
--annotate-search-top-k 100 \
--annotate-rerank-top-k 120 \
--language en
This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
Why Single-Query Evaluation Was Slow
If single-query evaluation is slow, the usual reason is that it is still running with auto_annotate=true, which means:
- perform live search
- detect recalled but unlabeled products
- call the LLM to label them
That is not the intended steady-state evaluation path.
The UI/API is now configured to prefer cached evaluation:
- default single-query evaluation uses
auto_annotate=false - unlabeled recalled results are treated as
Irrelevant - the response includes tips explaining that coverage gap
If you want stable, fast evaluation:
- prebuild labels offline
- use cached single-query evaluation
Web UI
Start the evaluation UI:
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--host 127.0.0.1 \
--port 6010
The UI provides:
- query list loaded from
queries.txt - single-query evaluation
- batch evaluation
- history of batch reports
- top recalled results
- missed
ExactandPartialproducts that were not recalled - tips about unlabeled hits treated as
Irrelevant
Single-query response behavior
For a single query:
- live search returns recalled
spu_idvalues - the framework looks up cached labels by
(query, spu_id) - unlabeled recalled items are counted as
Irrelevant - cached
ExactandPartialproducts that were not recalled are listed underMissed Exact / Partial
This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
CLI Commands
Build pooled annotation artifacts
./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
Run batch evaluation
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
Use --force-refresh-labels if you want to rebuild the offline label cache for the recalled window first.
Audit annotation quality
./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--labeler-mode simple
This checks cached labels against current guardrails and reports suspicious cases.
Batch Reports
Each batch run stores:
- aggregate metrics
- per-query metrics
- label distribution
- timestamp
- config snapshot from
/admin/config
Reports are written as:
- Markdown for easy reading
- JSON for downstream processing
Fusion Tuning
The tuning runner applies experiment configs sequentially and records the outcome.
Example:
./.venv/bin/python scripts/evaluation/tune_fusion.py \
--tenant-id 163 \
--queries-file scripts/evaluation/queries/queries.txt \
--top-k 50 \
--language en \
--experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
--score-metric MAP_3 \
--apply-best
What it does:
- writes an experiment config into
config/config.yaml - restarts backend
- runs batch evaluation
- stores the per-experiment result
- optionally applies the best experiment at the end
Current Practical Recommendation
For day-to-day evaluation:
- refresh the offline labels for the fixed query set with
batch --force-refresh-labels - run the web UI or normal batch evaluation in cached mode
- only force-refresh labels again when:
- the query set changes
- the product corpus changes materially
- the labeling logic changes
Caveats
- The current label cache is query-specific, not a full all-products all-queries matrix.
- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
Related Requirement Docs
README_Requirement.mdREADME_Requirement_zh.md
These documents describe the original problem statement. This README.md describes the implemented framework and the current recommended workflow.