README.md 14.6 KB

Search Evaluation Framework

This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.

Design: Build labels offline for one or more named evaluation datasets. Each dataset has a stable dataset_id backed by a query file registered in config.yaml -> search_evaluation.datasets. Single-query and batch evaluation map recalled spu_id values to the shared SQLite cache. Items without cached labels are scored as Irrelevant, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.

What it does

  1. Build an annotation set for a named evaluation dataset.
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.

Layout

Path Role
eval_framework/ Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (static/), CLI
build_annotation_set.py CLI entry (build / batch / audit)
serve_eval_web.py Web server for the evaluation UI
tune_fusion.py Applies config variants, restarts backend, runs batch eval, stores experiment reports
fusion_experiments_shortlist.json Compact experiment set for tuning
fusion_experiments_round1.json Broader first-round experiments
queries/queries.txt Legacy core query set (dataset_id=core_queries)
queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered Expanded clothing dataset (dataset_id=clothing_top771)
README_Requirement.md Product/requirements reference
start_eval.sh Wrapper: batch, batch-rebuild (deep build + --force-refresh-labels), or serve
../start_eval_web.sh Same as serve with activate.sh; use ./scripts/service_ctl.sh start eval-web (default port 6010, override with EVAL_WEB_PORT). ./run.sh all includes eval-web.

Quick start (repo root)

Set tenant if needed (export TENANT_ID=163). To switch datasets, export REPO_EVAL_DATASET_ID or pass --dataset-id. You need a live search API, DashScope when new LLM labels are required, and a running backend.

# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
./scripts/evaluation/start_eval.sh batch

# switch to the 771-query clothing dataset
REPO_EVAL_DATASET_ID=clothing_top771 ./scripts/evaluation/start_eval.sh batch

# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
./scripts/evaluation/start_eval.sh batch-rebuild

# UI: http://127.0.0.1:6010/
./scripts/evaluation/start_eval.sh serve
# or: ./scripts/service_ctl.sh start eval-web

Explicit equivalents:

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --dataset-id core_queries \
  --top-k 50 \
  --language en \
  --labeler-mode simple

./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id "${TENANT_ID:-163}" \
  --dataset-id core_queries \
  --search-depth 500 \
  --rerank-depth 10000 \
  --force-refresh-rerank \
  --force-refresh-labels \
  --language en \
  --labeler-mode simple

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --dataset-id core_queries \
  --host 127.0.0.1 \
  --port 6010

Each batch run walks the full queries file and writes a batch report under batch_reports/. With batch --force-refresh-labels, every live top-k hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).

start_eval.sh batch-rebuild (deep annotation rebuild)

This runs build_annotation_set.py build with --force-refresh-labels and --force-refresh-rerank (see the explicit command block below). It does not run the batch subcommand: there is no aggregate batch report for this step; outputs are per-query JSON under query_builds/ plus updates in search_eval.sqlite3.

For each query in queries.txt, in order:

  1. Search recall — Call the live search API with size = max(--search-depth, --search-recall-top-k) (the wrapper uses --search-depth 500). The first --search-recall-top-k hits (default 200, see eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K) form the recall pool; they are treated as rerank score 1 and are not sent to the reranker.
  2. Full corpus — Load the tenant’s product corpus from Elasticsearch (same tenant as TENANT_ID / --tenant-id, default 163), via corpus_docs() (cached in SQLite after the first load).
  3. Rerank outside pool — Every corpus document whose spu_id is not in the pool is scored by the reranker API, 80 documents per request. With --force-refresh-rerank, all those scores are recomputed and written to the rerank_scores table in search_eval.sqlite3. Without that flag, existing (tenant_id, query, spu_id) scores are reused and only missing rows hit the API.
  4. Skip “too easy” queries — If more than 1000 non-pool documents have rerank score > 0.5, that query is skipped (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  5. Global sort — Order to label: pool in search rank order, then all remaining corpus docs in descending rerank score (dedupe by spu_id, pool wins).
  6. LLM labeling — Walk that list from the head in batches of 50 by default (--rebuild-llm-batch-size). Each batch log includes exact_ratio, irrelevant_ratio, low_ratio, and irrelevant_plus_low_ratio.

Early stop (defaults in eval_framework.constants; overridable via CLI):

  • Run at least --rebuild-min-batches batches (10 by default) before any early stop is allowed.
  • After that, a bad batch is one where both are true (strict >):
    • Irrelevant proportion > 93.9% (--rebuild-irrelevant-stop-ratio, default 0.939), and
    • (Irrelevant + Weakly Relevant) proportion > 95.9% (--rebuild-irrel-low-combined-stop-ratio, default 0.959).
      (“Weakly Relevant” is the weak tier; Mostly Relevant and Exact do not enter this sum.)
  • Count consecutive bad batches. Reset the count to 0 on any batch that is not bad.
  • Stop when the consecutive bad count reaches --rebuild-irrelevant-stop-streak (3 by default), or when --rebuild-max-batches (40) is reached—whichever comes first (up to 2000 docs per query at default batch size).

So labeling follows best-first order but stops early after three consecutive batches that are overwhelmingly Irrelevant and Irrelevant+Low; the tail may never be judged.

Incremental pool (no full rebuild): build_annotation_set.py build without --force-refresh-labels uses the older windowed pool (--annotate-search-top-k, --annotate-rerank-top-k) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.

Tuning the rebuild path: --search-recall-top-k, --rerank-high-threshold, --rerank-high-skip-count, --rebuild-llm-batch-size, --rebuild-min-batches, --rebuild-max-batches, --rebuild-irrelevant-stop-ratio, --rebuild-irrel-low-combined-stop-ratio, --rebuild-irrelevant-stop-streak on build (see eval_framework/cli.py). Rerank API chunk size is 80 docs per request in code (full_corpus_rerank_outside_exclude).

Artifacts

Default root: artifacts/search_evaluation/

  • search_eval.sqlite3 — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  • datasets/<dataset_id>/query_builds/ — per-query pooled build outputs
  • datasets/<dataset_id>/batch_reports/<batch_id>/ — batch JSON, Markdown, config snapshot, dataset snapshot, query snapshot
  • datasets/<dataset_id>/audits/ — label-quality audit summaries
  • tuning_runs/ — fusion experiment outputs and config snapshots

Labels

  • Fully Relevant — Matches intended product type and all explicit required attributes.
  • Mostly Relevant — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
  • Weakly Relevant — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
  • Irrelevant — Type mismatch or important conflicts make it a poor search result.

Metric design

This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.

  • Primary scorecard The primary evaluation set is: NDCG@20, NDCG@50, ERR@10, Strong_Precision@10, Strong_Precision@20, Useful_Precision@50, Avg_Grade@10, Gain_Recall@20.
  • Composite tuning score: Primary_Metric_Score For experiment ranking we compute the mean of the primary scorecard after normalizing Avg_Grade@10 by the max grade (3).
  • Gain scheme Fully Relevant=3, Mostly Relevant=2, Weakly Relevant=1, Irrelevant=0 We keep the rel grades 3/2/1/0, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
  • Why this is better NDCG differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an Fully Relevant with a Weakly Relevant item is penalized more than swapping Mostly Relevant with Weakly Relevant.

The reported metrics are:

  • Primary_Metric_Score — Mean score over the primary scorecard (Avg_Grade@10 normalized by /3).
  • NDCG@20, NDCG@50, ERR@10, Strong_Precision@10, Strong_Precision@20, Useful_Precision@50, Avg_Grade@10, Gain_Recall@20 — Primary scorecard for optimization decisions.
  • NDCG@5, NDCG@10, ERR@5, ERR@20, ERR@50 — Secondary graded ranking quality slices.
  • Exact_Precision@K — Strict top-slot quality when only Fully Relevant counts.
  • Strong_Precision@K — Business-facing top-slot quality where Fully Relevant + Mostly Relevant count as strong positives.
  • Useful_Precision@K — Broader usefulness where any non-irrelevant result counts.
  • Gain_Recall@K — Gain captured in the returned list versus the judged label pool for the query.
  • Exact_Success@K / Strong_Success@K — Whether at least one exact or strong result appears in the first K.
  • MRR_Exact@10 / MRR_Strong@10 — How early the first exact or strong result appears.
  • Avg_Grade@10 — Average relevance grade of the visible first page.

Labeler modes: simple (default): one judging pass per batch with the standard relevance prompt. complex: query-profile extraction plus extra guardrails (for structured experiments).

Flows

Standard: Run batch without --force-refresh-labels to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to no auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as Irrelevant.

Rebuild vs incremental build: Deep rebuild is documented in the batch-rebuild subsection above. Incremental build (without --force-refresh-labels) uses --annotate-search-top-k / --annotate-rerank-top-k windows instead.

Fusion tuning: tune_fusion.py writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see --experiments-file, --score-metric, --apply-best).

Audit

./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

Web UI

Features: dataset selector, dataset-scoped query list, single-query and batch evaluation, dataset-scoped batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.

Batch reports

Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an /admin/config snapshot, as Markdown and JSON under batch_reports/.

To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:

  • request_id — the exact X-Request-ID sent by the evaluator for that live search call
  • top_label_sequence_top10 / top_label_sequence_top20 — compact label sequence strings such as 1:L3 | 2:L1 | 3:L2
  • top_results — a lightweight top-20 snapshot with rank, spu_id, label, title fields, and relevance_score

The Markdown report now surfaces the same case context in a lighter human-readable form:

  • request id
  • top-10 / top-20 label sequence
  • top 5 result snapshot for quick scanning

This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.

The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.

Ranking debug and LTR prep

debug_info now exposes two extra layers that are useful for tuning and future learning-to-rank work:

  • retrieval_plan — the effective text/image KNN plan for the query (k, num_candidates, boost, and whether the long-query branch was used).
  • ltr_summary — query-level summary over the visible top results: how many docs came from translation matches, text KNN, image KNN, fallback-to-ES cases, plus mean signal values.

At the document level, debug_info.per_result[*] and ranking_funnel.*.ltr_features include stable features such as:

  • es_score, text_score, knn_score, rerank_score, fine_score
  • source_score, translation_score, text_knn_score, image_knn_score
  • has_translation_match, has_text_knn, has_image_knn, text_score_fallback_to_es

This makes it easier to inspect bad cases and also gives us a near-direct feature inventory for downstream LTR experiments.

Caveats

  • Labels are keyed by (tenant_id, query, spu_id), not a full corpus×query matrix.
  • Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  • Backend restarts in automated tuning may need a short settle time before requests.
  • README_Requirement.md, README_Requirement_zh.md — requirements background; this file describes the implemented stack and how to run it.