Name	Last Update	Last Commit f8e7cb97 – evalution framework History
..
queries	Loading commit data...
README.md	Loading commit data...
README_Requirement.md	Loading commit data...
README_Requirement_zh.md	Loading commit data...
build_annotation_set.py	Loading commit data...
es_debug_search.py	Loading commit data...
eval_framework.py	Loading commit data...
eval_search_quality.py	Loading commit data...
quick_start_eval.sh	Loading commit data...
serve_eval_web.py	Loading commit data...
tune_fusion.py	Loading commit data...

README.md

Search Evaluation Framework

This directory contains the offline annotation set builder, the online evaluation UI/API, the audit tooling, and the fusion-tuning runner for retrieval quality evaluation.

It is designed around one core rule:

Annotation should be built offline first.
Single-query evaluation should then map recalled spu_id values to the cached annotation set.
Recalled items without cached labels are treated as Irrelevant during evaluation, and the UI/API returns a tip so the operator knows coverage is incomplete.

Goals

The framework supports four related tasks:

Build an annotation set for a fixed query set.
Evaluate a live search result list against that annotation set.
Run batch evaluation and store historical reports with config snapshots.
Tune fusion parameters reproducibly.

Files

eval_framework.py Search evaluation core implementation, CLI, FastAPI app, SQLite store, audit logic, and report generation.
build_annotation_set.py Thin CLI entrypoint into eval_framework.py.
serve_eval_web.py Thin web entrypoint into eval_framework.py.
tune_fusion.py Fusion experiment runner. It applies config variants, restarts backend, runs batch evaluation, and stores experiment reports.
fusion_experiments_shortlist.json A compact experiment set for practical tuning.
fusion_experiments_round1.json A broader first-round experiment set.
queries/queries.txt The canonical evaluation query set.
README_Requirement.md Requirement reference document.
quick_start_eval.sh Optional wrapper: batch (fill missing labels only), batch-rebuild (force full re-label), or serve (web UI).

Quick start (from repo root)

Set tenant if needed (export TENANT_ID=163). Requires live search API, DashScope key when the batch step needs new LLM labels, and a working backend.

# 1) Batch evaluation: every query in the file gets a live search; only uncached (query, spu_id) pairs call the LLM
./scripts/evaluation/quick_start_eval.sh batch

# Optional: full re-label of current top_k recall (expensive; use only when you intentionally rebuild the cache)
./scripts/evaluation/quick_start_eval.sh batch-rebuild

# 2) Evaluation UI on http://127.0.0.1:6010/
./scripts/evaluation/quick_start_eval.sh serve

Equivalent explicit commands:

# Safe default: no --force-refresh-labels
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

# Rebuild all labels for recalled top_k (same as quick_start_eval.sh batch-rebuild)
./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple \
  --force-refresh-labels

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id "${TENANT_ID:-163}" \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010

Batch behavior: There is no “skip queries already processed”. Each run walks the full queries file. With --force-refresh-labels, for every query the runner issues a live search and sends all top_k returned spu_ids through the LLM again (SQLite rows are upserted). Omit --force-refresh-labels if you only want to fill in labels that are missing for the current recall window.

Storage Layout

All generated artifacts are under:

/data/saas-search/artifacts/search_evaluation

Important subpaths:

/data/saas-search/artifacts/search_evaluation/search_eval.sqlite3 Main cache and annotation store.
/data/saas-search/artifacts/search_evaluation/query_builds Per-query pooled annotation-set build artifacts.
/data/saas-search/artifacts/search_evaluation/batch_reports Batch evaluation JSON, Markdown reports, and config snapshots.
/data/saas-search/artifacts/search_evaluation/audits Audit summaries for label quality checks.
/data/saas-search/artifacts/search_evaluation/tuning_runs Fusion experiment summaries and per-experiment config snapshots.

SQLite Schema Summary

The main tables in search_eval.sqlite3 are:

corpus_docs Cached product corpus for the tenant.
rerank_scores Cached full-corpus reranker scores keyed by (tenant_id, query_text, spu_id).
relevance_labels Cached LLM relevance labels keyed by (tenant_id, query_text, spu_id).
query_profiles Structured query-intent profiles extracted before labeling.
build_runs Per-query pooled-build records.
batch_runs Batch evaluation history.

Label Semantics

Three labels are used throughout:

Exact Fully matches the intended product type and all explicit required attributes.
Partial Main intent matches, but explicit attributes are missing, approximate, or weaker than requested.
Irrelevant Product type mismatches, or explicit required attributes conflict.

The framework always uses:

LLM-based batched relevance classification
caching and retry logic for robust offline labeling

There are now two labeler modes:

simple Default. A single low-coupling LLM judging pass per batch, using the standard relevance prompt.
complex Legacy structured mode. It extracts query profiles and applies extra guardrails. Kept for comparison, but no longer the default.

Offline-First Workflow

1. Refresh labels for the evaluation query set

For practical evaluation, the most important offline step is to pre-label the result window you plan to score. For the current metrics (P@5, P@10, P@20, P@50, MAP_3, MAP_2_3), a top_k=50 cached label set is sufficient.

Example (fills missing labels only; recommended default):

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

To rebuild every label for the current top_k recall window (all queries, all hits re-sent to the LLM), add --force-refresh-labels or run ./scripts/evaluation/quick_start_eval.sh batch-rebuild.

This command does two things:

runs every query in the file against the live backend (no skip list)
with --force-refresh-labels, re-labels all top_k hits per query via the LLM and upserts SQLite; without the flag, only spu_ids lacking a cached label are sent to the LLM

After this step, single-query evaluation can run in cached mode without calling the LLM again.

2. Optional pooled build

The framework also supports a heavier pooled build that combines:

top search results
top full-corpus reranker results

Example:

./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --search-depth 1000 \
  --rerank-depth 10000 \
  --annotate-search-top-k 100 \
  --annotate-rerank-top-k 120 \
  --language en

This is slower, but useful when you want a richer pooled annotation set beyond the current live recall window.

Why Single-Query Evaluation Was Slow

If single-query evaluation is slow, the usual reason is that it is still running with auto_annotate=true, which means:

perform live search
detect recalled but unlabeled products
call the LLM to label them

That is not the intended steady-state evaluation path.

The UI/API is now configured to prefer cached evaluation:

default single-query evaluation uses auto_annotate=false
unlabeled recalled results are treated as Irrelevant
the response includes tips explaining that coverage gap

If you want stable, fast evaluation:

prebuild labels offline
use cached single-query evaluation

Web UI

Start the evaluation UI:

./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --host 127.0.0.1 \
  --port 6010

The UI provides:

query list loaded from queries.txt
single-query evaluation
batch evaluation
history of batch reports
top recalled results
missed Exact and Partial products that were not recalled
tips about unlabeled hits treated as Irrelevant

Single-query response behavior

For a single query:

live search returns recalled spu_id values
the framework looks up cached labels by (query, spu_id)
unlabeled recalled items are counted as Irrelevant
cached Exact and Partial products that were not recalled are listed under Missed Exact / Partial

This makes the page useful as a real retrieval-evaluation view rather than only a search-result viewer.

CLI Commands

Build pooled annotation artifacts

./.venv/bin/python scripts/evaluation/build_annotation_set.py build ...

Run batch evaluation

./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

Use --force-refresh-labels if you want to rebuild the offline label cache for the recalled window first.

Audit annotation quality

./.venv/bin/python scripts/evaluation/build_annotation_set.py audit \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --labeler-mode simple

This checks cached labels against current guardrails and reports suspicious cases.

Batch Reports

Each batch run stores:

aggregate metrics
per-query metrics
label distribution
timestamp
config snapshot from /admin/config

Reports are written as:

Markdown for easy reading
JSON for downstream processing

Fusion Tuning

The tuning runner applies experiment configs sequentially and records the outcome.

Example:

./.venv/bin/python scripts/evaluation/tune_fusion.py \
  --tenant-id 163 \
  --queries-file scripts/evaluation/queries/queries.txt \
  --top-k 50 \
  --language en \
  --experiments-file scripts/evaluation/fusion_experiments_shortlist.json \
  --score-metric MAP_3 \
  --apply-best

What it does:

writes an experiment config into config/config.yaml
restarts backend
runs batch evaluation
stores the per-experiment result
optionally applies the best experiment at the end

Current Practical Recommendation

For day-to-day evaluation:

refresh the offline labels for the fixed query set with batch --force-refresh-labels
run the web UI or normal batch evaluation in cached mode
only force-refresh labels again when:
- the query set changes
- the product corpus changes materially
- the labeling logic changes

Caveats

The current label cache is query-specific, not a full all-products all-queries matrix.
Single-query evaluation still depends on the live search API for recall, but not on the LLM if labels are already cached.
The backend restart path in this environment can be briefly unstable immediately after startup; a short wait after restart is sometimes necessary for scripting.
Some multilingual translation hints are noisy on long-tail fashion queries, which is one reason fusion tuning around translation weight matters.

README_Requirement.md
README_Requirement_zh.md

These documents describe the original problem statement. This README.md describes the implemented framework and the current recommended workflow.

GITLAB

ai-saas / saas-search

README.md

Search Evaluation Framework

Goals

Files

Quick start (from repo root)

Storage Layout

SQLite Schema Summary

Label Semantics

Offline-First Workflow

1. Refresh labels for the evaluation query set

2. Optional pooled build

Why Single-Query Evaluation Was Slow

Web UI

Single-query response behavior

CLI Commands

Build pooled annotation artifacts

Run batch evaluation

Audit annotation quality

Batch Reports

Fusion Tuning

Current Practical Recommendation

Caveats

README.md

Search Evaluation Framework

Goals

Files

Quick start (from repo root)

Storage Layout

SQLite Schema Summary

Label Semantics

Offline-First Workflow

1. Refresh labels for the evaluation query set

2. Optional pooled build

Why Single-Query Evaluation Was Slow

Web UI

Single-query response behavior

CLI Commands

Build pooled annotation artifacts

Run batch evaluation

Audit annotation quality

Batch Reports

Fusion Tuning

Current Practical Recommendation

Caveats

Related Requirement Docs