README.md 11.7 KB

Search Evaluation Framework

This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.

It is designed around one core rule:

  • Annotation should be built offline first.
  • Single-query evaluation should then map recalled spu_id values to the cached annotation set.
  • Recalled items without cached labels are treated as Irrelevant during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.

Goals

The framework supports four related tasks:

  1. Build an annotation set for a fixed query set.
  2. Evaluate a live search result list against that annotation set.
  3. Run batch evaluation and store historical reports with config snapshots.
  4. Tune fusion parameters reproducibly.

Files

  • eval_framework.py Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.
  • build_annotation_set.py Thin CLI entrypoint into eval_framework.py.
  • serve_eval_web.py Thin web entrypoint into eval_framework.py.
  • tune_fusion.py Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
  • fusion_experiments_shortlist.json A compact experiment set for practical tuning.
  • fusion_experiments_round1.json A broader first-round experiment set.
  • queries/queries.txt The canonical evaluation query set.
  • README_Requirement.md Requirement reference document.
  • quick_start_eval.sh Optional wrapper: batch (fill missing labels only), batch-rebuild (force full re-label), or serve (web UI).
  • ../start_eval_web.sh Same as serve but loads activate.sh; use with ./scripts/service_ctl.sh start eval-web (port EVAL_WEB_PORT, default 6010). ./run.sh all starts eval-web with the rest of core services.

Quick start (from repo root)

Set tenant if needed (export TENANT_ID=163). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.

# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
./scripts/evaluation/quick_start_eval.sh batch

# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
./scripts/evaluation/quick_start_eval.sh batch-rebuild

# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
./scripts/evaluation/quick_start_eval.sh serve
# Or: ./scripts/service_ctl.sh start eval-web

Equivalent explicit commands:

# Safe default: no --force-refresh-labels
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple \
  --force-refresh-labels

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010

Batch behavior: There is no “skip queries already processed”. Each run walks the full queries file. With --force-refresh-labels, for every query the runner issues a live search and sends all top_k returned spu_ids through the LLM again (SQLite rows are upserted). Omit --force-refresh-labels if you only want to fill in labels that are missing for the current recall window.

Storage Layout

All generated artifacts are under:

  • /data/saas-search/artifacts/search_evaluation

Important subpaths:

  • /data/saas-search/artifacts/search_evaluation/search_eval.sqlite3 Main cache and annotation store.
  • /data/saas-search/artifacts/search_evaluation/query_builds Per-query pooled annotation-set build artifacts.
  • /data/saas-search/artifacts/search_evaluation/batch_reports Batch evaluation JSON, Markdown reports, and config snapshots.
  • /data/saas-search/artifacts/search_evaluation/audits Audit summaries for label quality checks.
  • /data/saas-search/artifacts/search_evaluation/tuning_runs Fusion experiment summaries and per-experiment config snapshots.

SQLite Schema Summary

The main tables in search_eval.sqlite3 are:

  • corpus_docs Cached product corpus for the tenant.
  • rerank_scores Cached full-corpus reranker scores keyed by (tenant_id, query_text, spu_id).
  • relevance_labels Cached LLM relevance labels keyed by (tenant_id, query_text, spu_id).
  • query_profiles Structured query-intent profiles extracted before labeling.
  • build_runs Per-query pooled-build records.
  • batch_runs Batch evaluation history.

Label Semantics

Three labels are used throughout:

  • Exact Fully matches the intended product type and all explicit required attributes.
  • Partial Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
  • Irrelevant Product type mismatches, or explicit required attributes conflict.

The framework always uses:

  • LLM-based batched relevance classification
  • caching and retry logic for robust offline labeling

There are now two labeler modes:

  • simple Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
  • complex Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.

Offline-First Workflow

1. Refresh labels for the evaluation query set

For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (P@5, P@10, P@20, P@50, MAP_3, MAP_2_3), a top_k=50 cached label set is sufficient.

Example (fills missing labels only; recommended default):

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

To rebuild every label for the current top_k recall window (all queries, all hits re-sent to the LLM), add --force-refresh-labels or run ./scripts/evaluation/quick_start_eval.sh batch-rebuild.

This command does two things:

  • runs every query in the file against the live backend (no skip list)
  • with --force-refresh-labels, re-labels all top_k hits per query via the LLM and upserts SQLite; without the flag, only spu_ids lacking a cached label are sent to the LLM

After this step, single-query evaluation can run in cached mode without calling the LLM again.

2. Optional pooled build

The framework also supports a heavier pooled build that combines:

  • top search results
  • top full-corpus reranker results

Example:

./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 1000 \
  --rerank-depth 10000 \
  --annotate-search-top-k 100 \
  --annotate-rerank-top-k 120 \
  --language en

This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.

Why Single-Query Evaluation Was Slow

If single-query evaluation is slow, the usual reason is that it is still running with auto_annotate=true, which means:

  • perform live search
  • detect recalled but unlabeled products
  • call the LLM to label them

That is not the intended steady-state evaluation path.

The UI/API is now configured to prefer cached evaluation:

  • default single-query evaluation uses auto_annotate=false
  • unlabeled recalled results are treated as Irrelevant
  • the response includes tips explaining that coverage gap

If you want stable, fast evaluation:

  1. prebuild labels offline
  2. use cached single-query evaluation

Web UI

Start the evaluation UI:

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010

The UI provides:

  • query list loaded from queries.txt
  • single-query evaluation
  • batch evaluation
  • history of batch reports
  • top recalled results
  • missed Exact and Partial products that were not recalled
  • tips about unlabeled hits treated as Irrelevant

Single-query response behavior

For a single query:

  1. live search returns recalled spu_id values
  2. the framework looks up cached labels by (query, spu_id)
  3. unlabeled recalled items are counted as Irrelevant
  4. cached Exact and Partial products that were not recalled are listed under Missed Exact / Partial

This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.

CLI Commands

Build pooled annotation artifacts

./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...

Run batch evaluation

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

Use --force-refresh-labels if you want to rebuild the offline label cache for the recalled window first.

Audit annotation quality

./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

This checks cached labels against current guardrails and reports suspicious cases.

Batch Reports

Each batch run stores:

  • aggregate metrics
  • per-query metrics
  • label distribution
  • timestamp
  • config snapshot from /admin/config

Reports are written as:

  • Markdown for easy reading
  • JSON for downstream processing

Fusion Tuning

The tuning runner applies experiment configs sequentially and records the outcome.

Example:

./.venv/bin/python scripts/evaluation/tune_fusion.py \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
  --score-metric MAP_3 \
  --apply-best

What it does:

  1. writes an experiment config into config/config.yaml
  2. restarts backend
  3. runs batch evaluation
  4. stores the per-experiment result
  5. optionally applies the best experiment at the end

Current Practical Recommendation

For day-to-day evaluation:

  1. refresh the offline labels for the fixed query set with batch --force-refresh-labels
  2. run the web UI or normal batch evaluation in cached mode
  3. only force-refresh labels again when:
    • the query set changes
    • the product corpus changes materially
    • the labeling logic changes

Caveats

  • The current label cache is query-specific, not a full all-products all-queries matrix.
  • Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
  • The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
  • Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
  • README_Requirement.md
  • README_Requirement_zh.md

These documents describe the original problem statement. This README.md describes the implemented framework and the current recommended workflow.