# Search Evaluation Framework

This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.

It is designed around one core rule:

- Annotation should be built offline first.
- Single-query evaluation should then map recalled `spu_id` values to the cached annotation set.
- Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.

## Goals

The framework supports four related tasks:

1. Build an annotation set for a fixed query set.
2. Evaluate a live search result list against that annotation set.
3. Run batch evaluation and store historical reports with config snapshots.
4. Tune fusion parameters reproducibly.

## Files

- `eval_framework.py`
  Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.
- `build_annotation_set.py`
  Thin CLI entrypoint into `eval_framework.py`.
- `serve_eval_web.py`
  Thin web entrypoint into `eval_framework.py`.
- `tune_fusion.py`
  Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
- `fusion_experiments_shortlist.json`
  A compact experiment set for practical tuning.
- `fusion_experiments_round1.json`
  A broader first-round experiment set.
- `queries/queries.txt`
  The canonical evaluation query set.
- `README_Requirement.md`
  Requirement reference document.
- `quick_start_eval.sh`
  Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI).
- `../start_eval_web.sh`
  Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services.

## Quick start (from repo root)

Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.

```bash
# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
./scripts/evaluation/quick_start_eval.sh batch

# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
./scripts/evaluation/quick_start_eval.sh batch-rebuild

# 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST)
./scripts/evaluation/quick_start_eval.sh serve
# Or: ./scripts/service_ctl.sh start eval-web
```

Equivalent explicit commands:

```bash
# Safe default: no --force-refresh-labels
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple \
  --force-refresh-labels

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010
```

**Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window.

## Storage Layout

All generated artifacts are under:

- `/data/saas-search/artifacts/search_evaluation`

Important subpaths:

- `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3`
  Main cache and annotation store.
- `/data/saas-search/artifacts/search_evaluation/query_builds`
  Per-query pooled annotation-set build artifacts.
- `/data/saas-search/artifacts/search_evaluation/batch_reports`
  Batch evaluation JSON, Markdown reports, and config snapshots.
- `/data/saas-search/artifacts/search_evaluation/audits`
  Audit summaries for label quality checks.
- `/data/saas-search/artifacts/search_evaluation/tuning_runs`
  Fusion experiment summaries and per-experiment config snapshots.

## SQLite Schema Summary

The main tables in `search_eval.sqlite3` are:

- `corpus_docs`
  Cached product corpus for the tenant.
- `rerank_scores`
  Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`.
- `relevance_labels`
  Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`.
- `query_profiles`
  Structured query-intent profiles extracted before labeling.
- `build_runs`
  Per-query pooled-build records.
- `batch_runs`
  Batch evaluation history.

## Label Semantics

Three labels are used throughout:

- `Exact`
  Fully matches the intended product type and all explicit required attributes.
- `Partial`
  Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
- `Irrelevant`
  Product type mismatches, or explicit required attributes conflict.

The framework always uses:

- LLM-based batched relevance classification
- caching and retry logic for robust offline labeling

There are now two labeler modes:

- `simple`
  Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
- `complex`
  Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.

## Offline-First Workflow

### 1. Refresh labels for the evaluation query set

For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient.

Example (fills missing labels only; recommended default):

```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
```

To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`.

This command does two things:

- runs **every** query in the file against the live backend (no skip list)
- with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM

After this step, single-query evaluation can run in cached mode without calling the LLM again.

### 2. Optional pooled build

The framework also supports a heavier pooled build that combines:

- top search results
- top full-corpus reranker results

Example:

```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 1000 \
  --rerank-depth 10000 \
  --annotate-search-top-k 100 \
  --annotate-rerank-top-k 120 \
  --language en
```

This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.

## Why Single-Query Evaluation Was Slow

If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means:

- perform live search
- detect recalled but unlabeled products
- call the LLM to label them

That is not the intended steady-state evaluation path.

The UI/API is now configured to prefer cached evaluation:

- default single-query evaluation uses `auto_annotate=false`
- unlabeled recalled results are treated as `Irrelevant`
- the response includes tips explaining that coverage gap

If you want stable, fast evaluation:

1. prebuild labels offline
2. use cached single-query evaluation

## Web UI

Start the evaluation UI:

```bash
./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010
```

The UI provides:

- query list loaded from `queries.txt`
- single-query evaluation
- batch evaluation
- history of batch reports
- top recalled results
- missed `Exact` and `Partial` products that were not recalled
- tips about unlabeled hits treated as `Irrelevant`

### Single-query response behavior

For a single query:

1. live search returns recalled `spu_id` values
2. the framework looks up cached labels by `(query, spu_id)`
3. unlabeled recalled items are counted as `Irrelevant`
4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial`

This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.

## CLI Commands

### Build pooled annotation artifacts

```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...
```

### Run batch evaluation

```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
```

Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first.

### Audit annotation quality

```bash
./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple
```

This checks cached labels against current guardrails and reports suspicious cases.

## Batch Reports

Each batch run stores:

- aggregate metrics
- per-query metrics
- label distribution
- timestamp
- config snapshot from `/admin/config`

Reports are written as:

- Markdown for easy reading
- JSON for downstream processing

## Fusion Tuning

The tuning runner applies experiment configs sequentially and records the outcome.

Example:

```bash
./.venv/bin/python scripts/evaluation/tune_fusion.py \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
  --score-metric MAP_3 \
  --apply-best
```

What it does:

1. writes an experiment config into `config/config.yaml`
2. restarts backend
3. runs batch evaluation
4. stores the per-experiment result
5. optionally applies the best experiment at the end

## Current Practical Recommendation

For day-to-day evaluation:

1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels`
2. run the web UI or normal batch evaluation in cached mode
3. only force-refresh labels again when:
   - the query set changes
   - the product corpus changes materially
   - the labeling logic changes

## Caveats

- The current label cache is query-specific, not a full all-products all-queries matrix.
- Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
- The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
- Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.

## Related Requirement Docs

- `README_Requirement.md`
- `README_Requirement_zh.md`

These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.