# Search Evaluation Framework This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality. **Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete. ## What it does 1. Build an annotation set for a fixed query set. 2. Evaluate live search results against cached labels. 3. Run batch evaluation and keep historical reports with config snapshots. 4. Tune fusion parameters in a reproducible loop. ## Layout | Path | Role | |------|------| | `eval_framework/` | Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI | | `build_annotation_set.py` | CLI entry (build / batch / audit) | | `serve_eval_web.py` | Web server for the evaluation UI | | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports | | `fusion_experiments_shortlist.json` | Compact experiment set for tuning | | `fusion_experiments_round1.json` | Broader first-round experiments | | `queries/queries.txt` | Canonical evaluation queries | | `README_Requirement.md` | Product/requirements reference | | `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild`, or `serve` | | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. | ## Quick start (repo root) Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend. ```bash # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM ./scripts/evaluation/quick_start_eval.sh batch # Full re-label of current top_k recall (expensive) ./scripts/evaluation/quick_start_eval.sh batch-rebuild # UI: http://127.0.0.1:6010/ ./scripts/evaluation/quick_start_eval.sh serve # or: ./scripts/service_ctl.sh start eval-web ``` Explicit equivalents: ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ --tenant-id "${TENANT_ID:-163}" \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ ... same args ... \ --force-refresh-labels ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \ --tenant-id "${TENANT_ID:-163}" \ --queries-file scripts/evaluation/queries/queries.txt \ --host 127.0.0.1 \ --port 6010 ``` Each batch run walks the full queries file. With `--force-refresh-labels`, every recalled `spu_id` in the window is re-sent to the LLM and upserted. Without it, only missing labels are filled. ## Artifacts Default root: `artifacts/search_evaluation/` - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata - `query_builds/` — per-query pooled build outputs - `batch_reports/` — batch JSON, Markdown, config snapshots - `audits/` — label-quality audit summaries - `tuning_runs/` — fusion experiment outputs and config snapshots ## Labels - **Exact** — Matches intended product type and all explicit required attributes. - **Partial** — Main intent matches; attributes missing, approximate, or weaker. - **Irrelevant** — Type mismatch or conflicting required attributes. **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments). ## Flows **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`. **Deeper pool:** `build_annotation_set.py build` merges deep search and full-corpus rerank windows before labeling (see CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`). ### Audit ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple ``` ## Web UI Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits. ## Batch reports Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. ## Caveats - Labels are keyed by `(tenant_id, query, spu_id)`, not a full corpus×query matrix. - Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist. - Backend restarts in automated tuning may need a short settle time before requests. ## Related docs - `README_Requirement.md`, `README_Requirement_zh.md` — requirements background; this file describes the implemented stack and how to run it.