Commit 7ddd4cb3acf5e2e0b748467448c83348c87eff20

Authored by tangwang
1 parent 9df421ed

评估体系从三等级->四等级 Exact Match / High Relevant / Low Relevant /

Irrelevant
scripts/evaluation/README.md
@@ -2,7 +2,7 @@ @@ -2,7 +2,7 @@
2 2
3 This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality. 3 This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
4 4
5 -**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete. 5 +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system and ranking metrics centered on `NDCG`.
6 6
7 ## What it does 7 ## What it does
8 8
@@ -112,9 +112,33 @@ Default root: `artifacts/search_evaluation/` @@ -112,9 +112,33 @@ Default root: `artifacts/search_evaluation/`
112 112
113 ## Labels 113 ## Labels
114 114
115 -- **Exact** — Matches intended product type and all explicit required attributes.  
116 -- **Partial** — Main intent matches; attributes missing, approximate, or weaker.  
117 -- **Irrelevant** — Type mismatch or conflicting required attributes. 115 +- **Exact Match** — Matches intended product type and all explicit required attributes.
  116 +- **High Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
  117 +- **Low Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
  118 +- **Irrelevant** — Type mismatch or important conflicts make it a poor search result.
  119 +
  120 +## Metric design
  121 +
  122 +This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.
  123 +
  124 +- **Primary metric: `NDCG@10`**
  125 + Uses the four labels as graded gains and rewards both relevance and early placement.
  126 +- **Gain scheme**
  127 + `Exact Match=7`, `High Relevant=3`, `Low Relevant=1`, `Irrelevant=0`
  128 + The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup.
  129 +- **Why this is better**
  130 + `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Exact Match` with a `Low Relevant` item is penalized more than swapping `High Relevant` with `Low Relevant`.
  131 +
  132 +The reported metrics are:
  133 +
  134 +- **`NDCG@5`, `NDCG@10`, `NDCG@20`, `NDCG@50`** — Primary graded ranking quality.
  135 +- **`Exact_Precision@K`** — Strict top-slot quality when only `Exact Match` counts.
  136 +- **`Strong_Precision@K`** — Business-facing top-slot quality where `Exact Match + High Relevant` count as strong positives.
  137 +- **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts.
  138 +- **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query.
  139 +- **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K.
  140 +- **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears.
  141 +- **`Avg_Grade@10`** — Average relevance grade of the visible first page.
118 142
119 **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments). 143 **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
120 144
@@ -139,11 +163,11 @@ Default root: `artifacts/search_evaluation/` @@ -139,11 +163,11 @@ Default root: `artifacts/search_evaluation/`
139 163
140 ## Web UI 164 ## Web UI
141 165
142 -Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits. 166 +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
143 167
144 ## Batch reports 168 ## Batch reports
145 169
146 -Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. 170 +Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
147 171
148 ## Caveats 172 ## Caveats
149 173
scripts/evaluation/eval_framework/constants.py
@@ -14,8 +14,22 @@ RELEVANCE_IRRELEVANT = "Irrelevant" @@ -14,8 +14,22 @@ RELEVANCE_IRRELEVANT = "Irrelevant"
14 14
15 VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT}) 15 VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT})
16 16
17 -# Precision / MAP "positive" set (all non-irrelevant tiers) 17 +# Useful label sets for binary diagnostic slices layered on top of graded ranking metrics.
18 RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW}) 18 RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW})
  19 +RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH})
  20 +
  21 +# Graded relevance for ranking evaluation.
  22 +# We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics.
  23 +RELEVANCE_GRADE_MAP = {
  24 + RELEVANCE_EXACT: 3,
  25 + RELEVANCE_HIGH: 2,
  26 + RELEVANCE_LOW: 1,
  27 + RELEVANCE_IRRELEVANT: 0,
  28 +}
  29 +RELEVANCE_GAIN_MAP = {
  30 + label: (2 ** grade) - 1
  31 + for label, grade in RELEVANCE_GRADE_MAP.items()
  32 +}
19 33
20 _LEGACY_LABEL_MAP = { 34 _LEGACY_LABEL_MAP = {
21 "Exact": RELEVANCE_EXACT, 35 "Exact": RELEVANCE_EXACT,
scripts/evaluation/eval_framework/framework.py
@@ -26,6 +26,7 @@ from .constants import ( @@ -26,6 +26,7 @@ from .constants import (
26 DEFAULT_RERANK_HIGH_THRESHOLD, 26 DEFAULT_RERANK_HIGH_THRESHOLD,
27 DEFAULT_SEARCH_RECALL_TOP_K, 27 DEFAULT_SEARCH_RECALL_TOP_K,
28 RELEVANCE_EXACT, 28 RELEVANCE_EXACT,
  29 + RELEVANCE_GAIN_MAP,
29 RELEVANCE_HIGH, 30 RELEVANCE_HIGH,
30 RELEVANCE_IRRELEVANT, 31 RELEVANCE_IRRELEVANT,
31 RELEVANCE_LOW, 32 RELEVANCE_LOW,
@@ -50,6 +51,18 @@ from .utils import ( @@ -50,6 +51,18 @@ from .utils import (
50 _log = logging.getLogger("search_eval.framework") 51 _log = logging.getLogger("search_eval.framework")
51 52
52 53
  54 +def _metric_context_payload() -> Dict[str, Any]:
  55 + return {
  56 + "primary_metric": "NDCG@10",
  57 + "gain_scheme": dict(RELEVANCE_GAIN_MAP),
  58 + "notes": [
  59 + "NDCG uses graded gains derived from the four relevance labels.",
  60 + "Strong metrics treat Exact Match and High Relevant as strong business positives.",
  61 + "Useful metrics treat any non-irrelevant item as useful recall coverage.",
  62 + ],
  63 + }
  64 +
  65 +
53 def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: 66 def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]:
54 """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``.""" 67 """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``."""
55 out: Dict[str, str] = {} 68 out: Dict[str, str] = {}
@@ -607,7 +620,7 @@ class SearchEvaluationFramework: @@ -607,7 +620,7 @@ class SearchEvaluationFramework:
607 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 620 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
608 for item in search_labeled_results[:100] 621 for item in search_labeled_results[:100]
609 ] 622 ]
610 - metrics = compute_query_metrics(top100_labels) 623 + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
611 output_dir = ensure_dir(self.artifact_root / "query_builds") 624 output_dir = ensure_dir(self.artifact_root / "query_builds")
612 run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" 625 run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
613 output_json_path = output_dir / f"{run_id}.json" 626 output_json_path = output_dir / f"{run_id}.json"
@@ -629,6 +642,7 @@ class SearchEvaluationFramework: @@ -629,6 +642,7 @@ class SearchEvaluationFramework:
629 "pool_size": len(pool_docs), 642 "pool_size": len(pool_docs),
630 }, 643 },
631 "metrics_top100": metrics, 644 "metrics_top100": metrics,
  645 + "metric_context": _metric_context_payload(),
632 "search_results": search_labeled_results, 646 "search_results": search_labeled_results,
633 "full_rerank_top": rerank_top_results, 647 "full_rerank_top": rerank_top_results,
634 } 648 }
@@ -816,7 +830,7 @@ class SearchEvaluationFramework: @@ -816,7 +830,7 @@ class SearchEvaluationFramework:
816 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 830 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
817 for item in search_labeled_results[:100] 831 for item in search_labeled_results[:100]
818 ] 832 ]
819 - metrics = compute_query_metrics(top100_labels) 833 + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
820 output_dir = ensure_dir(self.artifact_root / "query_builds") 834 output_dir = ensure_dir(self.artifact_root / "query_builds")
821 run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" 835 run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
822 output_json_path = output_dir / f"{run_id}.json" 836 output_json_path = output_dir / f"{run_id}.json"
@@ -838,6 +852,7 @@ class SearchEvaluationFramework: @@ -838,6 +852,7 @@ class SearchEvaluationFramework:
838 "ordered_union_size": pool_docs_count, 852 "ordered_union_size": pool_docs_count,
839 }, 853 },
840 "metrics_top100": metrics, 854 "metrics_top100": metrics,
  855 + "metric_context": _metric_context_payload(),
841 "search_results": search_labeled_results, 856 "search_results": search_labeled_results,
842 "full_rerank_top": rerank_top_results, 857 "full_rerank_top": rerank_top_results,
843 } 858 }
@@ -897,6 +912,10 @@ class SearchEvaluationFramework: @@ -897,6 +912,10 @@ class SearchEvaluationFramework:
897 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 912 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
898 for item in labeled 913 for item in labeled
899 ] 914 ]
  915 + ideal_labels = [
  916 + label if label in VALID_LABELS else RELEVANCE_IRRELEVANT
  917 + for label in labels.values()
  918 + ]
900 label_stats = self.store.get_query_label_stats(self.tenant_id, query) 919 label_stats = self.store.get_query_label_stats(self.tenant_id, query)
901 rerank_scores = self.store.get_rerank_scores(self.tenant_id, query) 920 rerank_scores = self.store.get_rerank_scores(self.tenant_id, query)
902 relevant_missing_ids = [ 921 relevant_missing_ids = [
@@ -947,12 +966,13 @@ class SearchEvaluationFramework: @@ -947,12 +966,13 @@ class SearchEvaluationFramework:
947 if unlabeled_hits: 966 if unlabeled_hits:
948 tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.") 967 tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.")
949 if not missing_relevant: 968 if not missing_relevant:
950 - tips.append("No cached non-irrelevant products were missed by this recall set.") 969 + tips.append("No cached judged useful products were missed by this recall set.")
951 return { 970 return {
952 "query": query, 971 "query": query,
953 "tenant_id": self.tenant_id, 972 "tenant_id": self.tenant_id,
954 "top_k": top_k, 973 "top_k": top_k,
955 - "metrics": compute_query_metrics(metric_labels), 974 + "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels),
  975 + "metric_context": _metric_context_payload(),
956 "results": labeled, 976 "results": labeled,
957 "missing_relevant": missing_relevant, 977 "missing_relevant": missing_relevant,
958 "label_stats": { 978 "label_stats": {
@@ -1004,12 +1024,12 @@ class SearchEvaluationFramework: @@ -1004,12 +1024,12 @@ class SearchEvaluationFramework:
1004 ) 1024 )
1005 m = live["metrics"] 1025 m = live["metrics"]
1006 _log.info( 1026 _log.info(
1007 - "[batch-eval] (%s/%s) query=%r P@10=%s MAP_3=%s total_hits=%s", 1027 + "[batch-eval] (%s/%s) query=%r NDCG@10=%s Strong_Precision@10=%s total_hits=%s",
1008 q_index, 1028 q_index,
1009 total_q, 1029 total_q,
1010 query, 1030 query,
1011 - m.get("P@10"),  
1012 - m.get("MAP_3"), 1031 + m.get("NDCG@10"),
  1032 + m.get("Strong_Precision@10"),
1013 live.get("total"), 1033 live.get("total"),
1014 ) 1034 )
1015 aggregate = aggregate_metrics([item["metrics"] for item in per_query]) 1035 aggregate = aggregate_metrics([item["metrics"] for item in per_query])
@@ -1033,6 +1053,7 @@ class SearchEvaluationFramework: @@ -1033,6 +1053,7 @@ class SearchEvaluationFramework:
1033 "queries": list(queries), 1053 "queries": list(queries),
1034 "top_k": top_k, 1054 "top_k": top_k,
1035 "aggregate_metrics": aggregate, 1055 "aggregate_metrics": aggregate,
  1056 + "metric_context": _metric_context_payload(),
1036 "aggregate_distribution": aggregate_distribution, 1057 "aggregate_distribution": aggregate_distribution,
1037 "per_query": per_query, 1058 "per_query": per_query,
1038 "config_snapshot_path": str(config_snapshot_path), 1059 "config_snapshot_path": str(config_snapshot_path),
scripts/evaluation/eval_framework/metrics.py
1 -"""IR metrics for labeled result lists.""" 1 +"""Ranking metrics for graded e-commerce relevance labels."""
2 2
3 from __future__ import annotations 3 from __future__ import annotations
4 4
5 -from typing import Dict, Sequence 5 +import math
  6 +from typing import Dict, Iterable, Sequence
6 7
7 -from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_NON_IRRELEVANT 8 +from .constants import (
  9 + RELEVANCE_EXACT,
  10 + RELEVANCE_GAIN_MAP,
  11 + RELEVANCE_GRADE_MAP,
  12 + RELEVANCE_HIGH,
  13 + RELEVANCE_IRRELEVANT,
  14 + RELEVANCE_LOW,
  15 + RELEVANCE_NON_IRRELEVANT,
  16 + RELEVANCE_STRONG,
  17 +)
8 18
9 19
10 -def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -> float: 20 +def _normalize_label(label: str) -> str:
  21 + if label in RELEVANCE_GRADE_MAP:
  22 + return label
  23 + return RELEVANCE_IRRELEVANT
  24 +
  25 +
  26 +def _gains_for_labels(labels: Sequence[str]) -> list[float]:
  27 + return [float(RELEVANCE_GAIN_MAP.get(_normalize_label(label), 0.0)) for label in labels]
  28 +
  29 +
  30 +def _binary_hits(labels: Sequence[str], relevant: Iterable[str]) -> list[int]:
  31 + relevant_set = set(relevant)
  32 + return [1 if _normalize_label(label) in relevant_set else 0 for label in labels]
  33 +
  34 +
  35 +def _precision_at_k_from_hits(hits: Sequence[int], k: int) -> float:
11 if k <= 0: 36 if k <= 0:
12 return 0.0 37 return 0.0
13 - sliced = list(labels[:k]) 38 + sliced = list(hits[:k])
14 if not sliced: 39 if not sliced:
15 return 0.0 40 return 0.0
16 - rel = set(relevant)  
17 - hits = sum(1 for label in sliced if label in rel)  
18 - return hits / float(min(k, len(sliced))) 41 + return sum(sliced) / float(len(sliced))
  42 +
  43 +
  44 +def _success_at_k_from_hits(hits: Sequence[int], k: int) -> float:
  45 + if k <= 0:
  46 + return 0.0
  47 + return 1.0 if any(hits[:k]) else 0.0
  48 +
  49 +
  50 +def _reciprocal_rank_from_hits(hits: Sequence[int], k: int) -> float:
  51 + if k <= 0:
  52 + return 0.0
  53 + for idx, hit in enumerate(hits[:k], start=1):
  54 + if hit:
  55 + return 1.0 / float(idx)
  56 + return 0.0
19 57
20 58
21 -def average_precision(labels: Sequence[str], relevant: Sequence[str]) -> float:  
22 - rel = set(relevant)  
23 - hit_count = 0  
24 - precision_sum = 0.0  
25 - for idx, label in enumerate(labels, start=1):  
26 - if label not in rel: 59 +def _dcg_at_k(gains: Sequence[float], k: int) -> float:
  60 + if k <= 0:
  61 + return 0.0
  62 + total = 0.0
  63 + for idx, gain in enumerate(gains[:k], start=1):
  64 + if gain <= 0.0:
27 continue 65 continue
28 - hit_count += 1  
29 - precision_sum += hit_count / idx  
30 - if hit_count == 0: 66 + total += gain / math.log2(idx + 1.0)
  67 + return total
  68 +
  69 +
  70 +def _ndcg_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float:
  71 + actual_gains = _gains_for_labels(labels)
  72 + ideal_gains = sorted(_gains_for_labels(ideal_labels), reverse=True)
  73 + dcg = _dcg_at_k(actual_gains, k)
  74 + idcg = _dcg_at_k(ideal_gains, k)
  75 + if idcg <= 0.0:
  76 + return 0.0
  77 + return dcg / idcg
  78 +
  79 +
  80 +def _gain_recall_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float:
  81 + ideal_total_gain = sum(_gains_for_labels(ideal_labels))
  82 + if ideal_total_gain <= 0.0:
31 return 0.0 83 return 0.0
32 - return precision_sum / hit_count 84 + actual_gain = sum(_gains_for_labels(labels[:k]))
  85 + return actual_gain / ideal_total_gain
33 86
34 87
35 -def compute_query_metrics(labels: Sequence[str]) -> Dict[str, float]:  
36 - """P@k / MAP_3: Exact Match only. P@k_2_3 / MAP_2_3: any non-irrelevant tier (legacy metric names).""" 88 +def _grade_avg_at_k(labels: Sequence[str], k: int) -> float:
  89 + if k <= 0:
  90 + return 0.0
  91 + sliced = [_normalize_label(label) for label in labels[:k]]
  92 + if not sliced:
  93 + return 0.0
  94 + return sum(float(RELEVANCE_GRADE_MAP.get(label, 0)) for label in sliced) / float(len(sliced))
  95 +
  96 +
  97 +def compute_query_metrics(
  98 + labels: Sequence[str],
  99 + *,
  100 + ideal_labels: Sequence[str] | None = None,
  101 +) -> Dict[str, float]:
  102 + """Compute graded ranking metrics plus binary diagnostic slices.
  103 +
  104 + `labels` are the ranked results returned by search.
  105 + `ideal_labels` is the judged label pool for the same query; when omitted we fall back
  106 + to the retrieved labels, which still keeps the metrics well-defined.
  107 + """
  108 +
  109 + ideal = list(ideal_labels) if ideal_labels is not None else list(labels)
37 metrics: Dict[str, float] = {} 110 metrics: Dict[str, float] = {}
38 - non_irrel = list(RELEVANCE_NON_IRRELEVANT) 111 +
  112 + exact_hits = _binary_hits(labels, [RELEVANCE_EXACT])
  113 + strong_hits = _binary_hits(labels, RELEVANCE_STRONG)
  114 + useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT)
  115 +
39 for k in (5, 10, 20, 50): 116 for k in (5, 10, 20, 50):
40 - metrics[f"P@{k}"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT]), 6)  
41 - metrics[f"P@{k}_2_3"] = round(precision_at_k(labels, k, non_irrel), 6)  
42 - metrics["MAP_3"] = round(average_precision(labels, [RELEVANCE_EXACT]), 6)  
43 - metrics["MAP_2_3"] = round(average_precision(labels, non_irrel), 6) 117 + metrics[f"NDCG@{k}"] = round(_ndcg_at_k(labels, ideal, k), 6)
  118 + for k in (5, 10, 20):
  119 + metrics[f"Exact_Precision@{k}"] = round(_precision_at_k_from_hits(exact_hits, k), 6)
  120 + metrics[f"Strong_Precision@{k}"] = round(_precision_at_k_from_hits(strong_hits, k), 6)
  121 + for k in (10, 20, 50):
  122 + metrics[f"Useful_Precision@{k}"] = round(_precision_at_k_from_hits(useful_hits, k), 6)
  123 + metrics[f"Gain_Recall@{k}"] = round(_gain_recall_at_k(labels, ideal, k), 6)
  124 + for k in (5, 10):
  125 + metrics[f"Exact_Success@{k}"] = round(_success_at_k_from_hits(exact_hits, k), 6)
  126 + metrics[f"Strong_Success@{k}"] = round(_success_at_k_from_hits(strong_hits, k), 6)
  127 + metrics["MRR_Exact@10"] = round(_reciprocal_rank_from_hits(exact_hits, 10), 6)
  128 + metrics["MRR_Strong@10"] = round(_reciprocal_rank_from_hits(strong_hits, 10), 6)
  129 + metrics["Avg_Grade@10"] = round(_grade_avg_at_k(labels, 10), 6)
44 return metrics 130 return metrics
45 131
46 132
47 def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, float]: 133 def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, float]:
48 if not metric_items: 134 if not metric_items:
49 return {} 135 return {}
50 - keys = sorted(metric_items[0].keys()) 136 + all_keys = sorted({key for item in metric_items for key in item.keys()})
51 return { 137 return {
52 key: round(sum(float(item.get(key, 0.0)) for item in metric_items) / len(metric_items), 6) 138 key: round(sum(float(item.get(key, 0.0)) for item in metric_items) / len(metric_items), 6)
53 - for key in keys 139 + for key in all_keys
54 } 140 }
55 141
56 142
scripts/evaluation/eval_framework/reports.py
@@ -7,6 +7,19 @@ from typing import Any, Dict @@ -7,6 +7,19 @@ from typing import Any, Dict
7 from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW 7 from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW
8 8
9 9
  10 +def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -> None:
  11 + primary_keys = ("NDCG@5", "NDCG@10", "NDCG@20", "Exact_Precision@10", "Strong_Precision@10", "Gain_Recall@50")
  12 + included = set()
  13 + for key in primary_keys:
  14 + if key in metrics:
  15 + lines.append(f"- {key}: {metrics[key]}")
  16 + included.add(key)
  17 + for key, value in sorted(metrics.items()):
  18 + if key in included:
  19 + continue
  20 + lines.append(f"- {key}: {value}")
  21 +
  22 +
10 def render_batch_report_markdown(payload: Dict[str, Any]) -> str: 23 def render_batch_report_markdown(payload: Dict[str, Any]) -> str:
11 lines = [ 24 lines = [
12 "# Search Batch Evaluation", 25 "# Search Batch Evaluation",
@@ -20,8 +33,16 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str: @@ -20,8 +33,16 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
20 "## Aggregate Metrics", 33 "## Aggregate Metrics",
21 "", 34 "",
22 ] 35 ]
23 - for key, value in sorted((payload.get("aggregate_metrics") or {}).items()):  
24 - lines.append(f"- {key}: {value}") 36 + metric_context = payload.get("metric_context") or {}
  37 + if metric_context:
  38 + lines.extend(
  39 + [
  40 + f"- Primary metric: {metric_context.get('primary_metric', 'N/A')}",
  41 + f"- Gain scheme: {metric_context.get('gain_scheme', {})}",
  42 + "",
  43 + ]
  44 + )
  45 + _append_metric_block(lines, payload.get("aggregate_metrics") or {})
25 distribution = payload.get("aggregate_distribution") or {} 46 distribution = payload.get("aggregate_distribution") or {}
26 if distribution: 47 if distribution:
27 lines.extend( 48 lines.extend(
@@ -39,8 +60,7 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str: @@ -39,8 +60,7 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
39 for item in payload.get("per_query") or []: 60 for item in payload.get("per_query") or []:
40 lines.append(f"### {item['query']}") 61 lines.append(f"### {item['query']}")
41 lines.append("") 62 lines.append("")
42 - for key, value in sorted((item.get("metrics") or {}).items()):  
43 - lines.append(f"- {key}: {value}") 63 + _append_metric_block(lines, item.get("metrics") or {})
44 distribution = item.get("distribution") or {} 64 distribution = item.get("distribution") or {}
45 lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}") 65 lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}")
46 lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}") 66 lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}")
scripts/evaluation/eval_framework/static/eval_web.css
@@ -6,7 +6,8 @@ @@ -6,7 +6,8 @@
6 --line: #ddd4c6; 6 --line: #ddd4c6;
7 --accent: #0f766e; 7 --accent: #0f766e;
8 --exact: #0f766e; 8 --exact: #0f766e;
9 - --partial: #b7791f; 9 + --high: #b7791f;
  10 + --low: #3b82a0;
10 --irrelevant: #b42318; 11 --irrelevant: #b42318;
11 } 12 }
12 body { margin: 0; font-family: "IBM Plex Sans", "Segoe UI", sans-serif; color: var(--ink); background: 13 body { margin: 0; font-family: "IBM Plex Sans", "Segoe UI", sans-serif; color: var(--ink); background:
@@ -29,6 +30,12 @@ @@ -29,6 +30,12 @@
29 button { border: 0; background: var(--accent); color: white; padding: 12px 16px; border-radius: 14px; cursor: pointer; font-weight: 600; } 30 button { border: 0; background: var(--accent); color: white; padding: 12px 16px; border-radius: 14px; cursor: pointer; font-weight: 600; }
30 button.secondary { background: #d9e6e3; color: #12433d; } 31 button.secondary { background: #d9e6e3; color: #12433d; }
31 .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(170px, 1fr)); gap: 12px; margin-bottom: 16px; } 32 .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(170px, 1fr)); gap: 12px; margin-bottom: 16px; }
  33 + .metric-context { margin: 0 0 12px; line-height: 1.5; }
  34 + .metric-section { margin-bottom: 18px; }
  35 + .metric-section-head { display: flex; align-items: baseline; justify-content: space-between; gap: 12px; margin-bottom: 10px; }
  36 + .metric-section-head h3 { margin: 0; font-size: 14px; color: #12433d; }
  37 + .metric-section-head p { margin: 0; color: var(--muted); font-size: 12px; }
  38 + .metric-grid { margin-bottom: 0; }
32 .metric { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; } 39 .metric { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; }
33 .metric .label { font-size: 12px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; } 40 .metric .label { font-size: 12px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; }
34 .metric .value { font-size: 24px; font-weight: 700; margin-top: 4px; } 41 .metric .value { font-size: 24px; font-weight: 700; margin-top: 4px; }
@@ -36,8 +43,8 @@ @@ -36,8 +43,8 @@
36 .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; } 43 .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; }
37 .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; } 44 .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; }
38 .label-exact-match { background: var(--exact); } 45 .label-exact-match { background: var(--exact); }
39 - .label-high-relevant { background: var(--partial); }  
40 - .label-low-relevant { background: #6b5b95; } 46 + .label-high-relevant { background: var(--high); }
  47 + .label-low-relevant { background: var(--low); }
41 .label-irrelevant { background: var(--irrelevant); } 48 .label-irrelevant { background: var(--irrelevant); }
42 .badge-unknown { background: #637381; } 49 .badge-unknown { background: #637381; }
43 .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } 50 .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; }
@@ -91,3 +98,13 @@ @@ -91,3 +98,13 @@
91 .report-modal-body.report-modal-loading, .report-modal-body.report-modal-error { color: var(--muted); font-style: italic; } 98 .report-modal-body.report-modal-loading, .report-modal-body.report-modal-error { color: var(--muted); font-style: italic; }
92 .tips { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; line-height: 1.6; } 99 .tips { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; line-height: 1.6; }
93 .tip { margin-bottom: 6px; color: var(--muted); } 100 .tip { margin-bottom: 6px; color: var(--muted); }
  101 + @media (max-width: 960px) {
  102 + .app { grid-template-columns: 1fr; }
  103 + .sidebar { border-right: 0; border-bottom: 1px solid var(--line); }
  104 + .metric-section-head { flex-direction: column; align-items: flex-start; }
  105 + }
  106 + @media (max-width: 640px) {
  107 + .main, .sidebar { padding: 16px; }
  108 + .result { grid-template-columns: 1fr; }
  109 + .thumb { width: 100%; max-width: 180px; height: auto; aspect-ratio: 1 / 1; }
  110 + }
scripts/evaluation/eval_framework/static/eval_web.js
1 - async function fetchJSON(url, options) {  
2 - const res = await fetch(url, options);  
3 - if (!res.ok) throw new Error(await res.text());  
4 - return await res.json();  
5 - }  
6 - function renderMetrics(metrics) {  
7 - const root = document.getElementById('metrics');  
8 - root.innerHTML = '';  
9 - Object.entries(metrics || {}).forEach(([key, value]) => {  
10 - const card = document.createElement('div');  
11 - card.className = 'metric';  
12 - card.innerHTML = `<div class="label">${key}</div><div class="value">${value}</div>`;  
13 - root.appendChild(card);  
14 - });  
15 - }  
16 - function labelBadgeClass(label) {  
17 - if (!label || label === 'Unknown') return 'badge-unknown';  
18 - return 'label-' + String(label).toLowerCase().replace(/\s+/g, '-');  
19 - }  
20 - function renderResults(results, rootId='results', showRank=true) {  
21 - const mount = document.getElementById(rootId);  
22 - mount.innerHTML = '';  
23 - (results || []).forEach(item => {  
24 - const label = item.label || 'Unknown';  
25 - const box = document.createElement('div');  
26 - box.className = 'result';  
27 - box.innerHTML = `  
28 - <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || '-'}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : 'not recalled')}</div></div>  
29 - <img class="thumb" src="${item.image_url || ''}" alt="" />  
30 - <div>  
31 - <div class="title">${item.title || ''}</div>  
32 - ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''}  
33 - <div class="options">  
34 - <div>${(item.option_values || [])[0] || ''}</div>  
35 - <div>${(item.option_values || [])[1] || ''}</div>  
36 - <div>${(item.option_values || [])[2] || ''}</div>  
37 - </div>  
38 - </div>`;  
39 - mount.appendChild(box);  
40 - });  
41 - if (!(results || []).length) {  
42 - mount.innerHTML = '<div class="muted">None.</div>';  
43 - }  
44 - }  
45 - function renderTips(data) {  
46 - const root = document.getElementById('tips');  
47 - const tips = [...(data.tips || [])];  
48 - const stats = data.label_stats || {};  
49 - tips.unshift(`Cached labels for query: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed (non-irrelevant): ${stats.missing_relevant_count || 0} — Exact: ${stats.missing_exact_count || 0}, High: ${stats.missing_high_count || 0}, Low: ${stats.missing_low_count || 0}.`);  
50 - root.innerHTML = tips.map(text => `<div class="tip">${text}</div>`).join('');  
51 - }  
52 - async function loadQueries() {  
53 - const data = await fetchJSON('/api/queries');  
54 - const root = document.getElementById('queryList');  
55 - root.innerHTML = '';  
56 - data.queries.forEach(query => {  
57 - const btn = document.createElement('button');  
58 - btn.className = 'query-item';  
59 - btn.textContent = query;  
60 - btn.onclick = () => {  
61 - document.getElementById('queryInput').value = query;  
62 - runSingle();  
63 - };  
64 - root.appendChild(btn);  
65 - });  
66 - }  
67 - function fmtMetric(m, key, digits) {  
68 - const v = m && m[key];  
69 - if (v == null || Number.isNaN(Number(v))) return null;  
70 - const n = Number(v);  
71 - return n.toFixed(digits);  
72 - }  
73 - function historySummaryHtml(meta) {  
74 - const m = meta && meta.aggregate_metrics;  
75 - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;  
76 - const parts = [];  
77 - if (nq != null) parts.push(`<span>Queries</span> ${nq}`);  
78 - const p10 = fmtMetric(m, 'P@10', 3);  
79 - const p52 = fmtMetric(m, 'P@5_2_3', 3);  
80 - const map3 = fmtMetric(m, 'MAP_3', 3);  
81 - if (p10) parts.push(`<span>P@10</span> ${p10}`);  
82 - if (p52) parts.push(`<span>P@5_2_3</span> ${p52}`);  
83 - if (map3) parts.push(`<span>MAP_3</span> ${map3}`);  
84 - if (!parts.length) return '';  
85 - return `<div class="hstats">${parts.join(' · ')}</div>`;  
86 - }  
87 - async function loadHistory() {  
88 - const data = await fetchJSON('/api/history');  
89 - const root = document.getElementById('history');  
90 - root.classList.remove('muted');  
91 - const items = data.history || [];  
92 - if (!items.length) {  
93 - root.innerHTML = '<span class="muted">No history yet.</span>';  
94 - return;  
95 - }  
96 - root.innerHTML = `<div class="history-list"></div>`;  
97 - const list = root.querySelector('.history-list');  
98 - items.forEach(item => {  
99 - const btn = document.createElement('button');  
100 - btn.type = 'button';  
101 - btn.className = 'history-item';  
102 - btn.setAttribute('aria-label', `Open report ${item.batch_id}`);  
103 - const sum = historySummaryHtml(item.metadata);  
104 - btn.innerHTML = `<div class="hid">${item.batch_id}</div>  
105 - <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`;  
106 - btn.onclick = () => openBatchReport(item.batch_id);  
107 - list.appendChild(btn);  
108 - });  
109 - }  
110 - let _lastReportPath = '';  
111 - function closeReportModal() {  
112 - const el = document.getElementById('reportModal');  
113 - el.classList.remove('is-open');  
114 - el.setAttribute('aria-hidden', 'true');  
115 - document.getElementById('reportModalBody').innerHTML = '';  
116 - document.getElementById('reportModalMeta').textContent = '';  
117 - }  
118 - async function openBatchReport(batchId) {  
119 - const el = document.getElementById('reportModal');  
120 - const body = document.getElementById('reportModalBody');  
121 - const metaEl = document.getElementById('reportModalMeta');  
122 - const titleEl = document.getElementById('reportModalTitle');  
123 - el.classList.add('is-open');  
124 - el.setAttribute('aria-hidden', 'false');  
125 - titleEl.textContent = batchId;  
126 - metaEl.textContent = '';  
127 - body.className = 'report-modal-body batch-report-md report-modal-loading';  
128 - body.textContent = 'Loading report…';  
129 - try {  
130 - const rep = await fetchJSON('/api/history/' + encodeURIComponent(batchId) + '/report');  
131 - _lastReportPath = rep.report_markdown_path || '';  
132 - metaEl.textContent = rep.report_markdown_path || '';  
133 - const raw = marked.parse(rep.markdown || '', { gfm: true });  
134 - const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } });  
135 - body.className = 'report-modal-body batch-report-md';  
136 - body.innerHTML = safe;  
137 - } catch (e) {  
138 - body.className = 'report-modal-body report-modal-error';  
139 - body.textContent = (e && e.message) ? e.message : String(e);  
140 - }  
141 - }  
142 - document.getElementById('reportModal').addEventListener('click', (ev) => {  
143 - if (ev.target && ev.target.getAttribute('data-close-report') === '1') closeReportModal(); 1 +async function fetchJSON(url, options) {
  2 + const res = await fetch(url, options);
  3 + if (!res.ok) throw new Error(await res.text());
  4 + return await res.json();
  5 +}
  6 +
  7 +function fmtNumber(value, digits = 3) {
  8 + if (value == null || Number.isNaN(Number(value))) return "-";
  9 + return Number(value).toFixed(digits);
  10 +}
  11 +
  12 +function metricSections(metrics) {
  13 + const groups = [
  14 + {
  15 + title: "Primary Ranking",
  16 + keys: ["NDCG@5", "NDCG@10", "NDCG@20", "NDCG@50"],
  17 + description: "Graded ranking quality across the four relevance tiers.",
  18 + },
  19 + {
  20 + title: "Top Slot Quality",
  21 + keys: ["Exact_Precision@5", "Exact_Precision@10", "Strong_Precision@5", "Strong_Precision@10", "Strong_Precision@20"],
  22 + description: "How much of the visible top rank is exact or strong business relevance.",
  23 + },
  24 + {
  25 + title: "Recall Coverage",
  26 + keys: ["Useful_Precision@10", "Useful_Precision@20", "Useful_Precision@50", "Gain_Recall@10", "Gain_Recall@20", "Gain_Recall@50"],
  27 + description: "How much judged relevance is captured in the returned list.",
  28 + },
  29 + {
  30 + title: "First Good Result",
  31 + keys: ["Exact_Success@5", "Exact_Success@10", "Strong_Success@5", "Strong_Success@10", "MRR_Exact@10", "MRR_Strong@10", "Avg_Grade@10"],
  32 + description: "Whether users see a good result early and how good the top page feels overall.",
  33 + },
  34 + ];
  35 + const seen = new Set();
  36 + return groups
  37 + .map((group) => {
  38 + const items = group.keys
  39 + .filter((key) => metrics && Object.prototype.hasOwnProperty.call(metrics, key))
  40 + .map((key) => {
  41 + seen.add(key);
  42 + return [key, metrics[key]];
  43 + });
  44 + return { ...group, items };
  45 + })
  46 + .filter((group) => group.items.length)
  47 + .concat(
  48 + (() => {
  49 + const rest = Object.entries(metrics || {}).filter(([key]) => !seen.has(key));
  50 + return rest.length
  51 + ? [{ title: "Other Metrics", description: "", items: rest }]
  52 + : [];
  53 + })()
  54 + );
  55 +}
  56 +
  57 +function renderMetrics(metrics, metricContext) {
  58 + const root = document.getElementById("metrics");
  59 + root.innerHTML = "";
  60 + const ctx = document.getElementById("metricContext");
  61 + const gainScheme = metricContext && metricContext.gain_scheme;
  62 + const primary = metricContext && metricContext.primary_metric;
  63 + ctx.textContent = primary
  64 + ? `Primary metric: ${primary}. Gain scheme: ${Object.entries(gainScheme || {}).map(([label, gain]) => `${label}=${gain}`).join(", ")}.`
  65 + : "";
  66 +
  67 + metricSections(metrics || {}).forEach((section) => {
  68 + const wrap = document.createElement("section");
  69 + wrap.className = "metric-section";
  70 + wrap.innerHTML = `
  71 + <div class="metric-section-head">
  72 + <h3>${section.title}</h3>
  73 + ${section.description ? `<p>${section.description}</p>` : ""}
  74 + </div>
  75 + <div class="grid metric-grid"></div>
  76 + `;
  77 + const grid = wrap.querySelector(".metric-grid");
  78 + section.items.forEach(([key, value]) => {
  79 + const card = document.createElement("div");
  80 + card.className = "metric";
  81 + card.innerHTML = `<div class="label">${key}</div><div class="value">${fmtNumber(value)}</div>`;
  82 + grid.appendChild(card);
144 }); 83 });
145 - document.addEventListener('keydown', (ev) => {  
146 - if (ev.key === 'Escape') closeReportModal();  
147 - });  
148 - document.getElementById('reportCopyPath').addEventListener('click', async () => {  
149 - if (!_lastReportPath) return;  
150 - try {  
151 - await navigator.clipboard.writeText(_lastReportPath);  
152 - } catch (_) {}  
153 - });  
154 - async function runSingle() {  
155 - const query = document.getElementById('queryInput').value.trim();  
156 - if (!query) return;  
157 - document.getElementById('status').textContent = `Evaluating "${query}"...`;  
158 - const data = await fetchJSON('/api/search-eval', {  
159 - method: 'POST',  
160 - headers: {'Content-Type': 'application/json'},  
161 - body: JSON.stringify({query, top_k: 100, auto_annotate: false})  
162 - });  
163 - document.getElementById('status').textContent = `Done. total=${data.total}`;  
164 - renderMetrics(data.metrics);  
165 - renderResults(data.results, 'results', true);  
166 - renderResults(data.missing_relevant, 'missingRelevant', false);  
167 - renderTips(data);  
168 - loadHistory();  
169 - }  
170 - async function runBatch() {  
171 - document.getElementById('status').textContent = 'Running batch evaluation...';  
172 - const data = await fetchJSON('/api/batch-eval', {  
173 - method: 'POST',  
174 - headers: {'Content-Type': 'application/json'},  
175 - body: JSON.stringify({top_k: 100, auto_annotate: false})  
176 - });  
177 - document.getElementById('status').textContent = `Batch done. report=${data.batch_id}`;  
178 - renderMetrics(data.aggregate_metrics);  
179 - renderResults([], 'results', true);  
180 - renderResults([], 'missingRelevant', false);  
181 - document.getElementById('tips').innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>';  
182 - loadHistory();  
183 - }  
184 - loadQueries();  
185 - loadHistory();  
186 - 84 + root.appendChild(wrap);
  85 + });
  86 +}
  87 +
  88 +function labelBadgeClass(label) {
  89 + if (!label || label === "Unknown") return "badge-unknown";
  90 + return "label-" + String(label).toLowerCase().replace(/\s+/g, "-");
  91 +}
  92 +
  93 +function renderResults(results, rootId = "results", showRank = true) {
  94 + const mount = document.getElementById(rootId);
  95 + mount.innerHTML = "";
  96 + (results || []).forEach((item) => {
  97 + const label = item.label || "Unknown";
  98 + const box = document.createElement("div");
  99 + box.className = "result";
  100 + box.innerHTML = `
  101 + <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || "-"}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : "not recalled")}</div></div>
  102 + <img class="thumb" src="${item.image_url || ""}" alt="" />
  103 + <div>
  104 + <div class="title">${item.title || ""}</div>
  105 + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ""}
  106 + <div class="options">
  107 + <div>${(item.option_values || [])[0] || ""}</div>
  108 + <div>${(item.option_values || [])[1] || ""}</div>
  109 + <div>${(item.option_values || [])[2] || ""}</div>
  110 + </div>
  111 + </div>`;
  112 + mount.appendChild(box);
  113 + });
  114 + if (!(results || []).length) {
  115 + mount.innerHTML = '<div class="muted">None.</div>';
  116 + }
  117 +}
  118 +
  119 +function renderTips(data) {
  120 + const root = document.getElementById("tips");
  121 + const tips = [...(data.tips || [])];
  122 + const stats = data.label_stats || {};
  123 + tips.unshift(
  124 + `Cached labels: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed judged useful results: ${stats.missing_relevant_count || 0} (Exact ${stats.missing_exact_count || 0}, High ${stats.missing_high_count || 0}, Low ${stats.missing_low_count || 0}).`
  125 + );
  126 + root.innerHTML = tips.map((text) => `<div class="tip">${text}</div>`).join("");
  127 +}
  128 +
  129 +async function loadQueries() {
  130 + const data = await fetchJSON("/api/queries");
  131 + const root = document.getElementById("queryList");
  132 + root.innerHTML = "";
  133 + data.queries.forEach((query) => {
  134 + const btn = document.createElement("button");
  135 + btn.className = "query-item";
  136 + btn.textContent = query;
  137 + btn.onclick = () => {
  138 + document.getElementById("queryInput").value = query;
  139 + runSingle();
  140 + };
  141 + root.appendChild(btn);
  142 + });
  143 +}
  144 +
  145 +function historySummaryHtml(meta) {
  146 + const m = meta && meta.aggregate_metrics;
  147 + const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
  148 + const parts = [];
  149 + if (nq != null) parts.push(`<span>Queries</span> ${nq}`);
  150 + if (m && m["NDCG@10"] != null) parts.push(`<span>NDCG@10</span> ${fmtNumber(m["NDCG@10"])}`);
  151 + if (m && m["Strong_Precision@10"] != null) parts.push(`<span>Strong@10</span> ${fmtNumber(m["Strong_Precision@10"])}`);
  152 + if (m && m["Gain_Recall@50"] != null) parts.push(`<span>Gain Recall@50</span> ${fmtNumber(m["Gain_Recall@50"])}`);
  153 + if (!parts.length) return "";
  154 + return `<div class="hstats">${parts.join(" · ")}</div>`;
  155 +}
  156 +
  157 +async function loadHistory() {
  158 + const data = await fetchJSON("/api/history");
  159 + const root = document.getElementById("history");
  160 + root.classList.remove("muted");
  161 + const items = data.history || [];
  162 + if (!items.length) {
  163 + root.innerHTML = '<span class="muted">No history yet.</span>';
  164 + return;
  165 + }
  166 + root.innerHTML = `<div class="history-list"></div>`;
  167 + const list = root.querySelector(".history-list");
  168 + items.forEach((item) => {
  169 + const btn = document.createElement("button");
  170 + btn.type = "button";
  171 + btn.className = "history-item";
  172 + btn.setAttribute("aria-label", `Open report ${item.batch_id}`);
  173 + const sum = historySummaryHtml(item.metadata);
  174 + btn.innerHTML = `<div class="hid">${item.batch_id}</div>
  175 + <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`;
  176 + btn.onclick = () => openBatchReport(item.batch_id);
  177 + list.appendChild(btn);
  178 + });
  179 +}
  180 +
  181 +let _lastReportPath = "";
  182 +
  183 +function closeReportModal() {
  184 + const el = document.getElementById("reportModal");
  185 + el.classList.remove("is-open");
  186 + el.setAttribute("aria-hidden", "true");
  187 + document.getElementById("reportModalBody").innerHTML = "";
  188 + document.getElementById("reportModalMeta").textContent = "";
  189 +}
  190 +
  191 +async function openBatchReport(batchId) {
  192 + const el = document.getElementById("reportModal");
  193 + const body = document.getElementById("reportModalBody");
  194 + const metaEl = document.getElementById("reportModalMeta");
  195 + const titleEl = document.getElementById("reportModalTitle");
  196 + el.classList.add("is-open");
  197 + el.setAttribute("aria-hidden", "false");
  198 + titleEl.textContent = batchId;
  199 + metaEl.textContent = "";
  200 + body.className = "report-modal-body batch-report-md report-modal-loading";
  201 + body.textContent = "Loading report…";
  202 + try {
  203 + const rep = await fetchJSON("/api/history/" + encodeURIComponent(batchId) + "/report");
  204 + _lastReportPath = rep.report_markdown_path || "";
  205 + metaEl.textContent = rep.report_markdown_path || "";
  206 + const raw = marked.parse(rep.markdown || "", { gfm: true });
  207 + const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } });
  208 + body.className = "report-modal-body batch-report-md";
  209 + body.innerHTML = safe;
  210 + } catch (e) {
  211 + body.className = "report-modal-body report-modal-error";
  212 + body.textContent = e && e.message ? e.message : String(e);
  213 + }
  214 +}
  215 +
  216 +document.getElementById("reportModal").addEventListener("click", (ev) => {
  217 + if (ev.target && ev.target.getAttribute("data-close-report") === "1") closeReportModal();
  218 +});
  219 +
  220 +document.addEventListener("keydown", (ev) => {
  221 + if (ev.key === "Escape") closeReportModal();
  222 +});
  223 +
  224 +document.getElementById("reportCopyPath").addEventListener("click", async () => {
  225 + if (!_lastReportPath) return;
  226 + try {
  227 + await navigator.clipboard.writeText(_lastReportPath);
  228 + } catch (_) {}
  229 +});
  230 +
  231 +async function runSingle() {
  232 + const query = document.getElementById("queryInput").value.trim();
  233 + if (!query) return;
  234 + document.getElementById("status").textContent = `Evaluating "${query}"...`;
  235 + const data = await fetchJSON("/api/search-eval", {
  236 + method: "POST",
  237 + headers: { "Content-Type": "application/json" },
  238 + body: JSON.stringify({ query, top_k: 100, auto_annotate: false }),
  239 + });
  240 + document.getElementById("status").textContent = `Done. total=${data.total}`;
  241 + renderMetrics(data.metrics, data.metric_context);
  242 + renderResults(data.results, "results", true);
  243 + renderResults(data.missing_relevant, "missingRelevant", false);
  244 + renderTips(data);
  245 + loadHistory();
  246 +}
  247 +
  248 +async function runBatch() {
  249 + document.getElementById("status").textContent = "Running batch evaluation...";
  250 + const data = await fetchJSON("/api/batch-eval", {
  251 + method: "POST",
  252 + headers: { "Content-Type": "application/json" },
  253 + body: JSON.stringify({ top_k: 100, auto_annotate: false }),
  254 + });
  255 + document.getElementById("status").textContent = `Batch done. report=${data.batch_id}`;
  256 + renderMetrics(data.aggregate_metrics, data.metric_context);
  257 + renderResults([], "results", true);
  258 + renderResults([], "missingRelevant", false);
  259 + document.getElementById("tips").innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>';
  260 + loadHistory();
  261 +}
  262 +
  263 +loadQueries();
  264 +loadHistory();
scripts/evaluation/eval_framework/static/index.html
@@ -30,6 +30,7 @@ @@ -30,6 +30,7 @@
30 <div id="status" class="muted section"></div> 30 <div id="status" class="muted section"></div>
31 <section class="section"> 31 <section class="section">
32 <h2>Metrics</h2> 32 <h2>Metrics</h2>
  33 + <p id="metricContext" class="muted metric-context"></p>
33 <div id="metrics" class="grid"></div> 34 <div id="metrics" class="grid"></div>
34 </section> 35 </section>
35 <section class="section"> 36 <section class="section">
@@ -37,7 +38,7 @@ @@ -37,7 +38,7 @@
37 <div id="results" class="results"></div> 38 <div id="results" class="results"></div>
38 </section> 39 </section>
39 <section class="section"> 40 <section class="section">
40 - <h2>Missed non-irrelevant (cached)</h2> 41 + <h2>Missed judged useful results</h2>
41 <div id="missingRelevant" class="results"></div> 42 <div id="missingRelevant" class="results"></div>
42 </section> 43 </section>
43 <section class="section"> 44 <section class="section">
@@ -67,4 +68,4 @@ @@ -67,4 +68,4 @@
67 <script src="https://cdn.jsdelivr.net/npm/dompurify@3.1.6/dist/purify.min.js"></script> 68 <script src="https://cdn.jsdelivr.net/npm/dompurify@3.1.6/dist/purify.min.js"></script>
68 <script src="/static/eval_web.js"></script> 69 <script src="/static/eval_web.js"></script>
69 </body> 70 </body>
70 -</html>  
71 \ No newline at end of file 71 \ No newline at end of file
  72 +</html>
scripts/evaluation/tune_fusion.py
@@ -150,7 +150,7 @@ def render_markdown(summary: Dict[str, Any]) -&gt; str: @@ -150,7 +150,7 @@ def render_markdown(summary: Dict[str, Any]) -&gt; str:
150 "", 150 "",
151 "## Experiments", 151 "## Experiments",
152 "", 152 "",
153 - "| Rank | Name | Score | MAP_3 | MAP_2_3 | P@5 | P@10 | Config |", 153 + "| Rank | Name | Score | NDCG@10 | NDCG@20 | Strong@10 | Gain Recall@50 | Config |",
154 "|---|---|---:|---:|---:|---:|---:|---|", 154 "|---|---|---:|---:|---:|---:|---:|---|",
155 ] 155 ]
156 for idx, item in enumerate(summary["experiments"], start=1): 156 for idx, item in enumerate(summary["experiments"], start=1):
@@ -162,10 +162,10 @@ def render_markdown(summary: Dict[str, Any]) -&gt; str: @@ -162,10 +162,10 @@ def render_markdown(summary: Dict[str, Any]) -&gt; str:
162 str(idx), 162 str(idx),
163 item["name"], 163 item["name"],
164 str(item["score"]), 164 str(item["score"]),
165 - str(metrics.get("MAP_3", "")),  
166 - str(metrics.get("MAP_2_3", "")),  
167 - str(metrics.get("P@5", "")),  
168 - str(metrics.get("P@10", "")), 165 + str(metrics.get("NDCG@10", "")),
  166 + str(metrics.get("NDCG@20", "")),
  167 + str(metrics.get("Strong_Precision@10", "")),
  168 + str(metrics.get("Gain_Recall@50", "")),
169 item["config_snapshot_path"], 169 item["config_snapshot_path"],
170 ] 170 ]
171 ) 171 )
@@ -206,7 +206,7 @@ def build_parser() -&gt; argparse.ArgumentParser: @@ -206,7 +206,7 @@ def build_parser() -&gt; argparse.ArgumentParser:
206 parser.add_argument("--language", default="en") 206 parser.add_argument("--language", default="en")
207 parser.add_argument("--experiments-file", required=True) 207 parser.add_argument("--experiments-file", required=True)
208 parser.add_argument("--search-base-url", default="http://127.0.0.1:6002") 208 parser.add_argument("--search-base-url", default="http://127.0.0.1:6002")
209 - parser.add_argument("--score-metric", default="MAP_3") 209 + parser.add_argument("--score-metric", default="NDCG@10")
210 parser.add_argument("--apply-best", action="store_true") 210 parser.add_argument("--apply-best", action="store_true")
211 parser.add_argument("--force-refresh-labels-first-pass", action="store_true") 211 parser.add_argument("--force-refresh-labels-first-pass", action="store_true")
212 return parser 212 return parser