# Search Evaluation Framework This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation. It is designed around one core rule: - Annotation should be built offline first. - Single-query evaluation should then map recalled `spu_id` values to the cached annotation set. - Recalled items without cached labels are treated as `Irrelevant` during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete. ## Goals The framework supports four related tasks: 1. Build an annotation set for a fixed query set. 2. Evaluate a live search result list against that annotation set. 3. Run batch evaluation and store historical reports with config snapshots. 4. Tune fusion parameters reproducibly. ## Files - `eval_framework.py` Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation. - `build_annotation_set.py` Thin CLI entrypoint into `eval_framework.py`. - `serve_eval_web.py` Thin web entrypoint into `eval_framework.py`. - `tune_fusion.py` Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports. - `fusion_experiments_shortlist.json` A compact experiment set for practical tuning. - `fusion_experiments_round1.json` A broader first-round experiment set. - `queries/queries.txt` The canonical evaluation query set. - `README_Requirement.md` Requirement reference document. - `quick_start_eval.sh` Optional wrapper: `batch` (fill missing labels only), `batch-rebuild` (force full re-label), or `serve` (web UI). - `../start_eval_web.sh` Same as `serve` but loads `activate.sh`; use with `./scripts/service_ctl.sh start eval-web` (port **`EVAL_WEB_PORT`**, default **6010**). `./run.sh all` starts **eval-web** with the rest of core services. ## Quick start (from repo root) Set tenant if needed (`export TENANT_ID=163`). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend. ```bash # 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM ./scripts/evaluation/quick_start_eval.sh batch # Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache) ./scripts/evaluation/quick_start_eval.sh batch-rebuild # 2) Evaluation UI on http://127.0.0.1:6010/ (override with EVAL_WEB_PORT / EVAL_WEB_HOST) ./scripts/evaluation/quick_start_eval.sh serve # Or: ./scripts/service_ctl.sh start eval-web ``` Equivalent explicit commands: ```bash # Safe default: no --force-refresh-labels ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ --tenant-id "${TENANT_ID:-163}" \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple # Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild) ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ --tenant-id "${TENANT_ID:-163}" \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple \ --force-refresh-labels ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \ --tenant-id "${TENANT_ID:-163}" \ --queries-file scripts/evaluation/queries/queries.txt \ --host 127.0.0.1 \ --port 6010 ``` **Batch behavior:** There is no “skip queries already processed”. Each run walks the full queries file. With `--force-refresh-labels`, for **every** query the runner issues a live search and sends **all** `top_k` returned `spu_id`s through the LLM again (SQLite rows are upserted). Omit `--force-refresh-labels` if you only want to fill in labels that are missing for the current recall window. ## Storage Layout All generated artifacts are under: - `/data/saas-search/artifacts/search_evaluation` Important subpaths: - `/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3` Main cache and annotation store. - `/data/saas-search/artifacts/search_evaluation/query_builds` Per-query pooled annotation-set build artifacts. - `/data/saas-search/artifacts/search_evaluation/batch_reports` Batch evaluation JSON, Markdown reports, and config snapshots. - `/data/saas-search/artifacts/search_evaluation/audits` Audit summaries for label quality checks. - `/data/saas-search/artifacts/search_evaluation/tuning_runs` Fusion experiment summaries and per-experiment config snapshots. ## SQLite Schema Summary The main tables in `search_eval.sqlite3` are: - `corpus_docs` Cached product corpus for the tenant. - `rerank_scores` Cached full-corpus reranker scores keyed by `(tenant_id, query_text, spu_id)`. - `relevance_labels` Cached LLM relevance labels keyed by `(tenant_id, query_text, spu_id)`. - `query_profiles` Structured query-intent profiles extracted before labeling. - `build_runs` Per-query pooled-build records. - `batch_runs` Batch evaluation history. ## Label Semantics Three labels are used throughout: - `Exact` Fully matches the intended product type and all explicit required attributes. - `Partial` Main intent matches, but explicit attributes are missing, approximate, or weaker than requested. - `Irrelevant` Product type mismatches, or explicit required attributes conflict. The framework always uses: - LLM-based batched relevance classification - caching and retry logic for robust offline labeling There are now two labeler modes: - `simple` Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt. - `complex` Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default. ## Offline-First Workflow ### 1. Refresh labels for the evaluation query set For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (`P@5`, `P@10`, `P@20`, `P@50`, `MAP_3`, `MAP_2_3`), a `top_k=50` cached label set is sufficient. Example (fills missing labels only; recommended default): ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple ``` To **rebuild** every label for the current `top_k` recall window (all queries, all hits re-sent to the LLM), add `--force-refresh-labels` or run `./scripts/evaluation/quick_start_eval.sh batch-rebuild`. This command does two things: - runs **every** query in the file against the live backend (no skip list) - with `--force-refresh-labels`, re-labels **all** `top_k` hits per query via the LLM and upserts SQLite; without the flag, only `spu_id`s lacking a cached label are sent to the LLM After this step, single-query evaluation can run in cached mode without calling the LLM again. ### 2. Optional pooled build The framework also supports a heavier pooled build that combines: - top search results - top full-corpus reranker results Example: ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --search-depth 1000 \ --rerank-depth 10000 \ --annotate-search-top-k 100 \ --annotate-rerank-top-k 120 \ --language en ``` This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window. ## Why Single-Query Evaluation Was Slow If single-query evaluation is slow, the usual reason is that it is still running with `auto_annotate=true`, which means: - perform live search - detect recalled but unlabeled products - call the LLM to label them That is not the intended steady-state evaluation path. The UI/API is now configured to prefer cached evaluation: - default single-query evaluation uses `auto_annotate=false` - unlabeled recalled results are treated as `Irrelevant` - the response includes tips explaining that coverage gap If you want stable, fast evaluation: 1. prebuild labels offline 2. use cached single-query evaluation ## Web UI Start the evaluation UI: ```bash ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --host 127.0.0.1 \ --port 6010 ``` The UI provides: - query list loaded from `queries.txt` - single-query evaluation - batch evaluation - history of batch reports - top recalled results - missed `Exact` and `Partial` products that were not recalled - tips about unlabeled hits treated as `Irrelevant` ### Single-query response behavior For a single query: 1. live search returns recalled `spu_id` values 2. the framework looks up cached labels by `(query, spu_id)` 3. unlabeled recalled items are counted as `Irrelevant` 4. cached `Exact` and `Partial` products that were not recalled are listed under `Missed Exact / Partial` This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer. ## CLI Commands ### Build pooled annotation artifacts ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py build ... ``` ### Run batch evaluation ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple ``` Use `--force-refresh-labels` if you want to rebuild the offline label cache for the recalled window first. ### Audit annotation quality ```bash ./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --labeler-mode simple ``` This checks cached labels against current guardrails and reports suspicious cases. ## Batch Reports Each batch run stores: - aggregate metrics - per-query metrics - label distribution - timestamp - config snapshot from `/admin/config` Reports are written as: - Markdown for easy reading - JSON for downstream processing ## Fusion Tuning The tuning runner applies experiment configs sequentially and records the outcome. Example: ```bash ./.venv/bin/python scripts/evaluation/tune_fusion.py \ --tenant-id 163 \ --queries-file scripts/evaluation/queries/queries.txt \ --top-k 50 \ --language en \ --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \ --score-metric MAP_3 \ --apply-best ``` What it does: 1. writes an experiment config into `config/config.yaml` 2. restarts backend 3. runs batch evaluation 4. stores the per-experiment result 5. optionally applies the best experiment at the end ## Current Practical Recommendation For day-to-day evaluation: 1. refresh the offline labels for the fixed query set with `batch --force-refresh-labels` 2. run the web UI or normal batch evaluation in cached mode 3. only force-refresh labels again when: - the query set changes - the product corpus changes materially - the labeling logic changes ## Caveats - The current label cache is query-specific, not a full all-products all-queries matrix. - Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached. - The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting. - Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters. ## Related Requirement Docs - `README_Requirement.md` - `README_Requirement_zh.md` These documents describe the original problem statement. This `README.md` describes the implemented framework and the current recommended workflow.