README.md 6.55 KB

Search Evaluation Framework

This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.

Design: Build labels offline for a fixed query set (queries/queries.txt). Single-query and batch evaluation map recalled spu_id values to the SQLite cache. Items without cached labels are scored as Irrelevant, and the UI/API surfaces tips when coverage is incomplete.

What it does

  1. Build an annotation set for a fixed query set.
  2. Evaluate live search results against cached labels.
  3. Run batch evaluation and keep historical reports with config snapshots.
  4. Tune fusion parameters in a reproducible loop.

Layout

Path Role
eval_framework/ Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (static/), CLI
build_annotation_set.py CLI entry (build / batch / audit)
serve_eval_web.py Web server for the evaluation UI
tune_fusion.py Applies config variants, restarts backend, runs batch eval, stores experiment reports
fusion_experiments_shortlist.json Compact experiment set for tuning
fusion_experiments_round1.json Broader first-round experiments
queries/queries.txt Canonical evaluation queries
README_Requirement.md Product/requirements reference
quick_start_eval.sh Wrapper: batch, batch-rebuild (deep build + --force-refresh-labels), or serve
../start_eval_web.sh Same as serve with activate.sh; use ./scripts/service_ctl.sh start eval-web (default port 6010, override with EVAL_WEB_PORT). ./run.sh all includes eval-web.

Quick start (repo root)

Set tenant if needed (export TENANT_ID=163). You need a live search API, DashScope when new LLM labels are required, and a running backend.

# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
./scripts/evaluation/quick_start_eval.sh batch

# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)
./scripts/evaluation/quick_start_eval.sh batch-rebuild

# UI: http://127.0.0.1:6010/
./scripts/evaluation/quick_start_eval.sh serve
# or: ./scripts/service_ctl.sh start eval-web

Explicit equivalents:

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 500 \
  --rerank-depth 10000 \
  --force-refresh-rerank \
  --force-refresh-labels \
  --language en \
  --labeler-mode simple

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010

Each batch run walks the full queries file. With batch --force-refresh-labels, every live top-k hit is re-judged by the LLM.

Rebuild (build --force-refresh-labels): For each query: take search top 500 as the recall pool (treated as rerank score 1; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than 1000 non-pool docs have rerank score > 0.5, the query is skipped (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of 50, logging exact_ratio and irrelevant_ratio per batch. Stop after 3 consecutive batches with irrelevant_ratio > 92%, but only after at least 15 batches and at most 40 batches.

Artifacts

Default root: artifacts/search_evaluation/

  • search_eval.sqlite3 — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
  • query_builds/ — per-query pooled build outputs
  • batch_reports/ — batch JSON, Markdown, config snapshots
  • audits/ — label-quality audit summaries
  • tuning_runs/ — fusion experiment outputs and config snapshots

Labels

  • Exact — Matches intended product type and all explicit required attributes.
  • Partial — Main intent matches; attributes missing, approximate, or weaker.
  • Irrelevant — Type mismatch or conflicting required attributes.

Labeler modes: simple (default): one judging pass per batch with the standard relevance prompt. complex: query-profile extraction plus extra guardrails (for structured experiments).

Flows

Standard: Run batch without --force-refresh-labels to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to no auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as Irrelevant.

Incremental pool (no full rebuild): build_annotation_set.py build without --force-refresh-labels merges search and full-corpus rerank windows before labeling (CLI --search-depth, --rerank-depth, --annotate-*-top-k). Full rebuild uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via --search-recall-top-k, --rerank-high-threshold, --rerank-high-skip-count, --rebuild-* flags on build.

Fusion tuning: tune_fusion.py writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see --experiments-file, --score-metric, --apply-best).

Audit

./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

Web UI

Features: query list from queries.txt, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.

Batch reports

Each run stores aggregate and per-query metrics, label distribution, timestamp, and an /admin/config snapshot, as Markdown and JSON under batch_reports/.

Caveats

  • Labels are keyed by (tenant_id, query, spu_id), not a full corpus×query matrix.
  • Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
  • Backend restarts in automated tuning may need a short settle time before requests.
  • README_Requirement.md, README_Requirement_zh.md — requirements background; this file describes the implemented stack and how to run it.