scripts/evaluation/README.md

# Search Evaluation Framework
This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.
It is designed around one core rule:
- Annotation should be built offline first.
- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.
## Goals
The framework supports four related tasks:
1. Build an annotation set for a fixed query set.
2. Evaluate a live search result list against that annotation set.
3. Run batch evaluation and store historical reports with config snapshots.
4. Tune fusion parameters reproducibly.
## Files
- `eval_framework.py`
  Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.
- `build_annotation_set.py`
  Thin CLI entrypoint into `eval_framework.py`.
- `serve_eval_web.py`
  Thin web entrypoint into `eval_framework.py`.
- `tune_fusion.py`
  Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
- `fusion_experiments_shortlist.json`
  A compact experiment set for practical tuning.
- `fusion_experiments_round1.json`
  A broader first-round experiment set.
- `queries/queries.txt`
  The canonical evaluation query set.
- `README_Requirement.md`
  Requirement reference document.
- `quick_start_eval.sh`
  Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
- `../start_eval_web.sh`
  Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.
## Quick start (from repo root)
Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.
```bash
# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
./scripts/evaluation/quick_start_eval.sh batch
# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
./scripts/evaluation/quick_start_eval.sh batch-rebuild
# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
./scripts/evaluation/quick_start_eval.sh serve
# Or: ./scripts/service_ctl.sh start eval-web
```
Equivalent explicit commands:
```bash
# Safe default: no --force-refresh-labels
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple \
  --force-refresh-labels
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010
```
**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.
## Storage Layout
All generated artifacts are under:
- `/data/saas-search/artifacts/search_evaluation`
Important subpaths:
- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
  Main cache and annotation store.
- `/data/saas-search/artifacts/search_evaluation/query_builds`
  Per-query pooled annotation-set build artifacts.
- `/data/saas-search/artifacts/search_evaluation/batch_reports`
  Batch evaluation JSON, Markdown reports, and config snapshots.
- `/data/saas-search/artifacts/search_evaluation/audits`
  Audit summaries for label quality checks.
- `/data/saas-search/artifacts/search_evaluation/tuning_runs`
  Fusion experiment summaries and per-experiment config snapshots.
## SQLite Schema Summary
The main tables in `search_eval.sqlite3` are:
- `corpus_docs`
  Cached product corpus for the tenant.
- `rerank_scores`
  Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
- `relevance_labels`
  Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
- `query_profiles`
  Structured query-intent profiles extracted before labeling.
- `build_runs`
  Per-query pooled-build records.
- `batch_runs`
  Batch evaluation history.
## Label Semantics
Three labels are used throughout:
- `Exact`
  Fully matches the intended product type and all explicit required attributes.
- `Partial`
  Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
- `Irrelevant`
  Product type mismatches, or explicit required attributes conflict.
The framework always uses:
- LLM-based batched relevance classification
- caching and retry logic for robust offline labeling
There are now two labeler modes:
- `simple`
  Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
- `complex`
  Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.
## Offline-First Workflow
### 1. Refresh labels for the evaluation query set
For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.
Example (fills missing labels only; recommended default):
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
```
To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.
This command does two things:
- runs **every** query in the file against the live backend (no skip list)
- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM
After this step, single-query evaluation can run in cached mode without calling the LLM again.
### 2. Optional pooled build
The framework also supports a heavier pooled build that combines:
- top search results
- top full-corpus reranker results
Example:
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 1000 \
  --rerank-depth 10000 \
  --annotate-search-top-k 100 \
  --annotate-rerank-top-k 120 \
  --language en
```
This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.
## Why Single-Query Evaluation Was Slow
If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:
- perform live search
- detect recalled but unlabeled products
- call the LLM to label them
That is not the intended steady-state evaluation path.
The UI/API is now configured to prefer cached evaluation:
- default single-query evaluation uses `auto_annotate=false`
- unlabeled recalled results are treated as `Irrelevant`
- the response includes tips explaining that coverage gap
If you want stable, fast evaluation:
1. prebuild labels offline
2. use cached single-query evaluation
## Web UI
Start the evaluation UI:
```bash
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010
```
The UI provides:
- query list loaded from `queries.txt`
- single-query evaluation
- batch evaluation
- history of batch reports
- top recalled results
- missed `Exact` and `Partial` products that were not recalled
- tips about unlabeled hits treated as `Irrelevant`
### Single-query response behavior
For a single query:
1. live search returns recalled `spu_id` values
2. the framework looks up cached labels by `(query, spu_id)`
3. unlabeled recalled items are counted as `Irrelevant`
4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`
This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.
## CLI Commands
### Build pooled annotation artifacts
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
```
### Run batch evaluation
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
```
Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.
### Audit annotation quality
```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
```
This checks cached labels against current guardrails and reports suspicious cases.
## Batch Reports
Each batch run stores:
- aggregate metrics
- per-query metrics
- label distribution
- timestamp
- config snapshot from `/admin/config`
Reports are written as:
- Markdown for easy reading
- JSON for downstream processing
## Fusion Tuning
The tuning runner applies experiment configs sequentially and records the outcome.
Example:
```bash
./.venv/bin/python scripts/evaluation/tune_fusion.py \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
  --score-metric MAP_3 \
  --apply-best
```
What it does:
1. writes an experiment config into `config/config.yaml`
2. restarts backend
3. runs batch evaluation
4. stores the per-experiment result
5. optionally applies the best experiment at the end
## Current Practical Recommendation
For day-to-day evaluation:
1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
2. run the web UI or normal batch evaluation in cached mode
3. only force-refresh labels again when:
   - the query set changes
   - the product corpus changes materially
   - the labeling logic changes
## Caveats
- The current label cache is query-specific, not a full all-products all-queries matrix.
- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.
## Related Requirement Docs
- `README_Requirement.md`
- `README_Requirement_zh.md`
These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.