Name	Last Update	Last Commit a3734f13 – eval任务美国地区不支持batch调用，改为在线调用 History
..
eval_framework	Loading commit data...
queries	Loading commit data...
README.md	Loading commit data...
README_Requirement.md	Loading commit data...
README_Requirement_zh.md	Loading commit data...
build_annotation_set.py	Loading commit data...
es_debug_search.py	Loading commit data...
serve_eval_web.py	Loading commit data...
start_eval.sh	Loading commit data...
tune_fusion.py	Loading commit data...

README.md

Search Evaluation Framework

This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.

Design: Build labels offline for a fixed query set (queries/queries.txt). Single-query and batch evaluation map recalled spu_id values to the SQLite cache. Items without cached labels are scored as Irrelevant, and the UI/API surfaces tips when coverage is incomplete.

What it does

Build an annotation set for a fixed query set.
Evaluate live search results against cached labels.
Run batch evaluation and keep historical reports with config snapshots.
Tune fusion parameters in a reproducible loop.

Layout

Path	Role
`eval_framework/`	Package: orchestration, SQLite store, search/rerank/LLM clients, prompts, metrics, reports, web UI (`static/`), CLI
`build_annotation_set.py`	CLI entry (build / batch / audit)
`serve_eval_web.py`	Web server for the evaluation UI
`tune_fusion.py`	Applies config variants, restarts backend, runs batch eval, stores experiment reports
`fusion_experiments_shortlist.json`	Compact experiment set for tuning
`fusion_experiments_round1.json`	Broader first-round experiments
`queries/queries.txt`	Canonical evaluation queries
`README_Requirement.md`	Product/requirements reference
`start_eval.sh.sh`	Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve`
`../start_eval_web.sh`	Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port 6010, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web.

Quick start (repo root)

Set tenant if needed (export TENANT_ID=163). You need a live search API, DashScope when new LLM labels are required, and a running backend.

# Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
./scripts/evaluation/start_eval.sh.sh batch

# Deep rebuild: per-query full corpus rerank (outside search top-500 pool) + LLM in 50-doc batches along global sort order (early stop; expensive)
./scripts/evaluation/start_eval.sh.sh batch-rebuild

# UI: http://127.0.0.1:6010/
./scripts/evaluation/start_eval.sh.sh serve
# or: ./scripts/service_ctl.sh start eval-web

Explicit equivalents:

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 500 \
  --rerank-depth 10000 \
  --force-refresh-rerank \
  --force-refresh-labels \
  --language en \
  --labeler-mode simple

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010

Each batch run walks the full queries file and writes a batch report under batch_reports/. With batch --force-refresh-labels, every live top-k hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).

`start_eval.sh.sh batch-rebuild` (deep annotation rebuild)

This runs build_annotation_set.py build with --force-refresh-labels and --force-refresh-rerank (see the explicit command block below). It does not run the batch subcommand: there is no aggregate batch report for this step; outputs are per-query JSON under query_builds/ plus updates in search_eval.sqlite3.

For each query in queries.txt, in order:

Search recall — Call the live search API with size = max(--search-depth, --search-recall-top-k) (the wrapper uses --search-depth 500). The first 500 hits form the recall pool; they are treated as rerank score 1 and are not sent to the reranker.
Full corpus — Load the tenant’s product corpus from Elasticsearch (same tenant as TENANT_ID / --tenant-id, default 163), via corpus_docs() (cached in SQLite after the first load).
Rerank outside pool — Every corpus document whose spu_id is not in the pool is scored by the reranker API, 80 documents per request. With --force-refresh-rerank, all those scores are recomputed and written to the rerank_scores table in search_eval.sqlite3. Without that flag, existing (tenant_id, query, spu_id) scores are reused and only missing rows hit the API.
Skip “too easy” queries — If more than 1000 non-pool documents have rerank score > 0.5, that query is skipped (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
Global sort — Order to label: pool in search rank order, then all remaining corpus docs in descending rerank score (dedupe by spu_id, pool wins).
LLM labeling — Walk that list from the head in batches of 50 (not “take top-K then label only K”): each batch logs exact_ratio and irrelevant_ratio. After at least 20 batches, stop when 3 consecutive batches have irrelevant_ratio > 92%; never more than 40 batches (2000 docs max per query). So labeling follows the best-first order but stops early; the tail of the sorted list may never be judged.

Incremental pool (no full rebuild): build_annotation_set.py build without --force-refresh-labels uses the older windowed pool (--annotate-search-top-k, --annotate-rerank-top-k) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.

Tuning the rebuild path: --search-recall-top-k, --rerank-high-threshold, --rerank-high-skip-count, --rebuild-llm-batch-size, --rebuild-min-batches, --rebuild-max-batches, --rebuild-irrelevant-stop-ratio, --rebuild-irrelevant-stop-streak on build (see eval_framework/cli.py). Rerank API chunk size is 80 docs per request in code (full_corpus_rerank_outside_exclude).

Artifacts

Default root: artifacts/search_evaluation/

search_eval.sqlite3 — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
query_builds/ — per-query pooled build outputs
batch_reports/ — batch JSON, Markdown, config snapshots
audits/ — label-quality audit summaries
tuning_runs/ — fusion experiment outputs and config snapshots

Labels

Exact — Matches intended product type and all explicit required attributes.
Partial — Main intent matches; attributes missing, approximate, or weaker.
Irrelevant — Type mismatch or conflicting required attributes.

Labeler modes: simple (default): one judging pass per batch with the standard relevance prompt. complex: query-profile extraction plus extra guardrails (for structured experiments).

Flows

Standard: Run batch without --force-refresh-labels to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to no auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as Irrelevant.

Rebuild vs incremental build: Deep rebuild is documented in the batch-rebuild subsection above. Incremental build (without --force-refresh-labels) uses --annotate-search-top-k / --annotate-rerank-top-k windows instead.

Fusion tuning: tune_fusion.py writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see --experiments-file, --score-metric, --apply-best).

Audit

./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

Web UI

Features: query list from queries.txt, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.

Batch reports

Each run stores aggregate and per-query metrics, label distribution, timestamp, and an /admin/config snapshot, as Markdown and JSON under batch_reports/.

Caveats

Labels are keyed by (tenant_id, query, spu_id), not a full corpus×query matrix.
Single-query evaluation still needs live search for recall; LLM calls are avoided when labels exist.
Backend restarts in automated tuning may need a short settle time before requests.

README_Requirement.md, README_Requirement_zh.md — requirements background; this file describes the implemented stack and how to run it.

GITLAB

ai-saas / saas-search

README.md

Search Evaluation Framework

What it does

Layout

Quick start (repo root)

`start_eval.sh.sh batch-rebuild` (deep annotation rebuild)

Artifacts

Labels

Flows

Audit

Web UI

Batch reports

Caveats

README.md

Search Evaluation Framework

What it does

Layout

Quick start (repo root)

start_eval.sh.sh batch-rebuild (deep annotation rebuild)

Artifacts

Labels

Flows

Audit

Web UI

Batch reports

Caveats

Related docs

`start_eval.sh.sh batch-rebuild` (deep annotation rebuild)