Commit 7ddd4cb3acf5e2e0b748467448c83348c87eff20
1 parent
9df421ed
评估体系从三等级->四等级 Exact Match / High Relevant / Low Relevant /
Irrelevant
Showing
9 changed files
with
502 additions
and
241 deletions
Show diff stats
scripts/evaluation/README.md
| @@ -2,7 +2,7 @@ | @@ -2,7 +2,7 @@ | ||
| 2 | 2 | ||
| 3 | This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality. | 3 | This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality. |
| 4 | 4 | ||
| 5 | -**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete. | 5 | +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system and ranking metrics centered on `NDCG`. |
| 6 | 6 | ||
| 7 | ## What it does | 7 | ## What it does |
| 8 | 8 | ||
| @@ -112,9 +112,33 @@ Default root: `artifacts/search_evaluation/` | @@ -112,9 +112,33 @@ Default root: `artifacts/search_evaluation/` | ||
| 112 | 112 | ||
| 113 | ## Labels | 113 | ## Labels |
| 114 | 114 | ||
| 115 | -- **Exact** — Matches intended product type and all explicit required attributes. | ||
| 116 | -- **Partial** — Main intent matches; attributes missing, approximate, or weaker. | ||
| 117 | -- **Irrelevant** — Type mismatch or conflicting required attributes. | 115 | +- **Exact Match** — Matches intended product type and all explicit required attributes. |
| 116 | +- **High Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off. | ||
| 117 | +- **Low Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match. | ||
| 118 | +- **Irrelevant** — Type mismatch or important conflicts make it a poor search result. | ||
| 119 | + | ||
| 120 | +## Metric design | ||
| 121 | + | ||
| 122 | +This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance. | ||
| 123 | + | ||
| 124 | +- **Primary metric: `NDCG@10`** | ||
| 125 | + Uses the four labels as graded gains and rewards both relevance and early placement. | ||
| 126 | +- **Gain scheme** | ||
| 127 | + `Exact Match=7`, `High Relevant=3`, `Low Relevant=1`, `Irrelevant=0` | ||
| 128 | + The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup. | ||
| 129 | +- **Why this is better** | ||
| 130 | + `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Exact Match` with a `Low Relevant` item is penalized more than swapping `High Relevant` with `Low Relevant`. | ||
| 131 | + | ||
| 132 | +The reported metrics are: | ||
| 133 | + | ||
| 134 | +- **`NDCG@5`, `NDCG@10`, `NDCG@20`, `NDCG@50`** — Primary graded ranking quality. | ||
| 135 | +- **`Exact_Precision@K`** — Strict top-slot quality when only `Exact Match` counts. | ||
| 136 | +- **`Strong_Precision@K`** — Business-facing top-slot quality where `Exact Match + High Relevant` count as strong positives. | ||
| 137 | +- **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts. | ||
| 138 | +- **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query. | ||
| 139 | +- **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K. | ||
| 140 | +- **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears. | ||
| 141 | +- **`Avg_Grade@10`** — Average relevance grade of the visible first page. | ||
| 118 | 142 | ||
| 119 | **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments). | 143 | **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments). |
| 120 | 144 | ||
| @@ -139,11 +163,11 @@ Default root: `artifacts/search_evaluation/` | @@ -139,11 +163,11 @@ Default root: `artifacts/search_evaluation/` | ||
| 139 | 163 | ||
| 140 | ## Web UI | 164 | ## Web UI |
| 141 | 165 | ||
| 142 | -Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits. | 166 | +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits. |
| 143 | 167 | ||
| 144 | ## Batch reports | 168 | ## Batch reports |
| 145 | 169 | ||
| 146 | -Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. | 170 | +Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. |
| 147 | 171 | ||
| 148 | ## Caveats | 172 | ## Caveats |
| 149 | 173 |
scripts/evaluation/eval_framework/constants.py
| @@ -14,8 +14,22 @@ RELEVANCE_IRRELEVANT = "Irrelevant" | @@ -14,8 +14,22 @@ RELEVANCE_IRRELEVANT = "Irrelevant" | ||
| 14 | 14 | ||
| 15 | VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT}) | 15 | VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT}) |
| 16 | 16 | ||
| 17 | -# Precision / MAP "positive" set (all non-irrelevant tiers) | 17 | +# Useful label sets for binary diagnostic slices layered on top of graded ranking metrics. |
| 18 | RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW}) | 18 | RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW}) |
| 19 | +RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH}) | ||
| 20 | + | ||
| 21 | +# Graded relevance for ranking evaluation. | ||
| 22 | +# We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics. | ||
| 23 | +RELEVANCE_GRADE_MAP = { | ||
| 24 | + RELEVANCE_EXACT: 3, | ||
| 25 | + RELEVANCE_HIGH: 2, | ||
| 26 | + RELEVANCE_LOW: 1, | ||
| 27 | + RELEVANCE_IRRELEVANT: 0, | ||
| 28 | +} | ||
| 29 | +RELEVANCE_GAIN_MAP = { | ||
| 30 | + label: (2 ** grade) - 1 | ||
| 31 | + for label, grade in RELEVANCE_GRADE_MAP.items() | ||
| 32 | +} | ||
| 19 | 33 | ||
| 20 | _LEGACY_LABEL_MAP = { | 34 | _LEGACY_LABEL_MAP = { |
| 21 | "Exact": RELEVANCE_EXACT, | 35 | "Exact": RELEVANCE_EXACT, |
scripts/evaluation/eval_framework/framework.py
| @@ -26,6 +26,7 @@ from .constants import ( | @@ -26,6 +26,7 @@ from .constants import ( | ||
| 26 | DEFAULT_RERANK_HIGH_THRESHOLD, | 26 | DEFAULT_RERANK_HIGH_THRESHOLD, |
| 27 | DEFAULT_SEARCH_RECALL_TOP_K, | 27 | DEFAULT_SEARCH_RECALL_TOP_K, |
| 28 | RELEVANCE_EXACT, | 28 | RELEVANCE_EXACT, |
| 29 | + RELEVANCE_GAIN_MAP, | ||
| 29 | RELEVANCE_HIGH, | 30 | RELEVANCE_HIGH, |
| 30 | RELEVANCE_IRRELEVANT, | 31 | RELEVANCE_IRRELEVANT, |
| 31 | RELEVANCE_LOW, | 32 | RELEVANCE_LOW, |
| @@ -50,6 +51,18 @@ from .utils import ( | @@ -50,6 +51,18 @@ from .utils import ( | ||
| 50 | _log = logging.getLogger("search_eval.framework") | 51 | _log = logging.getLogger("search_eval.framework") |
| 51 | 52 | ||
| 52 | 53 | ||
| 54 | +def _metric_context_payload() -> Dict[str, Any]: | ||
| 55 | + return { | ||
| 56 | + "primary_metric": "NDCG@10", | ||
| 57 | + "gain_scheme": dict(RELEVANCE_GAIN_MAP), | ||
| 58 | + "notes": [ | ||
| 59 | + "NDCG uses graded gains derived from the four relevance labels.", | ||
| 60 | + "Strong metrics treat Exact Match and High Relevant as strong business positives.", | ||
| 61 | + "Useful metrics treat any non-irrelevant item as useful recall coverage.", | ||
| 62 | + ], | ||
| 63 | + } | ||
| 64 | + | ||
| 65 | + | ||
| 53 | def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: | 66 | def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: |
| 54 | """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``.""" | 67 | """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``.""" |
| 55 | out: Dict[str, str] = {} | 68 | out: Dict[str, str] = {} |
| @@ -607,7 +620,7 @@ class SearchEvaluationFramework: | @@ -607,7 +620,7 @@ class SearchEvaluationFramework: | ||
| 607 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | 620 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT |
| 608 | for item in search_labeled_results[:100] | 621 | for item in search_labeled_results[:100] |
| 609 | ] | 622 | ] |
| 610 | - metrics = compute_query_metrics(top100_labels) | 623 | + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) |
| 611 | output_dir = ensure_dir(self.artifact_root / "query_builds") | 624 | output_dir = ensure_dir(self.artifact_root / "query_builds") |
| 612 | run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" | 625 | run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" |
| 613 | output_json_path = output_dir / f"{run_id}.json" | 626 | output_json_path = output_dir / f"{run_id}.json" |
| @@ -629,6 +642,7 @@ class SearchEvaluationFramework: | @@ -629,6 +642,7 @@ class SearchEvaluationFramework: | ||
| 629 | "pool_size": len(pool_docs), | 642 | "pool_size": len(pool_docs), |
| 630 | }, | 643 | }, |
| 631 | "metrics_top100": metrics, | 644 | "metrics_top100": metrics, |
| 645 | + "metric_context": _metric_context_payload(), | ||
| 632 | "search_results": search_labeled_results, | 646 | "search_results": search_labeled_results, |
| 633 | "full_rerank_top": rerank_top_results, | 647 | "full_rerank_top": rerank_top_results, |
| 634 | } | 648 | } |
| @@ -816,7 +830,7 @@ class SearchEvaluationFramework: | @@ -816,7 +830,7 @@ class SearchEvaluationFramework: | ||
| 816 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | 830 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT |
| 817 | for item in search_labeled_results[:100] | 831 | for item in search_labeled_results[:100] |
| 818 | ] | 832 | ] |
| 819 | - metrics = compute_query_metrics(top100_labels) | 833 | + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) |
| 820 | output_dir = ensure_dir(self.artifact_root / "query_builds") | 834 | output_dir = ensure_dir(self.artifact_root / "query_builds") |
| 821 | run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" | 835 | run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" |
| 822 | output_json_path = output_dir / f"{run_id}.json" | 836 | output_json_path = output_dir / f"{run_id}.json" |
| @@ -838,6 +852,7 @@ class SearchEvaluationFramework: | @@ -838,6 +852,7 @@ class SearchEvaluationFramework: | ||
| 838 | "ordered_union_size": pool_docs_count, | 852 | "ordered_union_size": pool_docs_count, |
| 839 | }, | 853 | }, |
| 840 | "metrics_top100": metrics, | 854 | "metrics_top100": metrics, |
| 855 | + "metric_context": _metric_context_payload(), | ||
| 841 | "search_results": search_labeled_results, | 856 | "search_results": search_labeled_results, |
| 842 | "full_rerank_top": rerank_top_results, | 857 | "full_rerank_top": rerank_top_results, |
| 843 | } | 858 | } |
| @@ -897,6 +912,10 @@ class SearchEvaluationFramework: | @@ -897,6 +912,10 @@ class SearchEvaluationFramework: | ||
| 897 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | 912 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT |
| 898 | for item in labeled | 913 | for item in labeled |
| 899 | ] | 914 | ] |
| 915 | + ideal_labels = [ | ||
| 916 | + label if label in VALID_LABELS else RELEVANCE_IRRELEVANT | ||
| 917 | + for label in labels.values() | ||
| 918 | + ] | ||
| 900 | label_stats = self.store.get_query_label_stats(self.tenant_id, query) | 919 | label_stats = self.store.get_query_label_stats(self.tenant_id, query) |
| 901 | rerank_scores = self.store.get_rerank_scores(self.tenant_id, query) | 920 | rerank_scores = self.store.get_rerank_scores(self.tenant_id, query) |
| 902 | relevant_missing_ids = [ | 921 | relevant_missing_ids = [ |
| @@ -947,12 +966,13 @@ class SearchEvaluationFramework: | @@ -947,12 +966,13 @@ class SearchEvaluationFramework: | ||
| 947 | if unlabeled_hits: | 966 | if unlabeled_hits: |
| 948 | tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.") | 967 | tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.") |
| 949 | if not missing_relevant: | 968 | if not missing_relevant: |
| 950 | - tips.append("No cached non-irrelevant products were missed by this recall set.") | 969 | + tips.append("No cached judged useful products were missed by this recall set.") |
| 951 | return { | 970 | return { |
| 952 | "query": query, | 971 | "query": query, |
| 953 | "tenant_id": self.tenant_id, | 972 | "tenant_id": self.tenant_id, |
| 954 | "top_k": top_k, | 973 | "top_k": top_k, |
| 955 | - "metrics": compute_query_metrics(metric_labels), | 974 | + "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels), |
| 975 | + "metric_context": _metric_context_payload(), | ||
| 956 | "results": labeled, | 976 | "results": labeled, |
| 957 | "missing_relevant": missing_relevant, | 977 | "missing_relevant": missing_relevant, |
| 958 | "label_stats": { | 978 | "label_stats": { |
| @@ -1004,12 +1024,12 @@ class SearchEvaluationFramework: | @@ -1004,12 +1024,12 @@ class SearchEvaluationFramework: | ||
| 1004 | ) | 1024 | ) |
| 1005 | m = live["metrics"] | 1025 | m = live["metrics"] |
| 1006 | _log.info( | 1026 | _log.info( |
| 1007 | - "[batch-eval] (%s/%s) query=%r P@10=%s MAP_3=%s total_hits=%s", | 1027 | + "[batch-eval] (%s/%s) query=%r NDCG@10=%s Strong_Precision@10=%s total_hits=%s", |
| 1008 | q_index, | 1028 | q_index, |
| 1009 | total_q, | 1029 | total_q, |
| 1010 | query, | 1030 | query, |
| 1011 | - m.get("P@10"), | ||
| 1012 | - m.get("MAP_3"), | 1031 | + m.get("NDCG@10"), |
| 1032 | + m.get("Strong_Precision@10"), | ||
| 1013 | live.get("total"), | 1033 | live.get("total"), |
| 1014 | ) | 1034 | ) |
| 1015 | aggregate = aggregate_metrics([item["metrics"] for item in per_query]) | 1035 | aggregate = aggregate_metrics([item["metrics"] for item in per_query]) |
| @@ -1033,6 +1053,7 @@ class SearchEvaluationFramework: | @@ -1033,6 +1053,7 @@ class SearchEvaluationFramework: | ||
| 1033 | "queries": list(queries), | 1053 | "queries": list(queries), |
| 1034 | "top_k": top_k, | 1054 | "top_k": top_k, |
| 1035 | "aggregate_metrics": aggregate, | 1055 | "aggregate_metrics": aggregate, |
| 1056 | + "metric_context": _metric_context_payload(), | ||
| 1036 | "aggregate_distribution": aggregate_distribution, | 1057 | "aggregate_distribution": aggregate_distribution, |
| 1037 | "per_query": per_query, | 1058 | "per_query": per_query, |
| 1038 | "config_snapshot_path": str(config_snapshot_path), | 1059 | "config_snapshot_path": str(config_snapshot_path), |
scripts/evaluation/eval_framework/metrics.py
| 1 | -"""IR metrics for labeled result lists.""" | 1 | +"""Ranking metrics for graded e-commerce relevance labels.""" |
| 2 | 2 | ||
| 3 | from __future__ import annotations | 3 | from __future__ import annotations |
| 4 | 4 | ||
| 5 | -from typing import Dict, Sequence | 5 | +import math |
| 6 | +from typing import Dict, Iterable, Sequence | ||
| 6 | 7 | ||
| 7 | -from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_NON_IRRELEVANT | 8 | +from .constants import ( |
| 9 | + RELEVANCE_EXACT, | ||
| 10 | + RELEVANCE_GAIN_MAP, | ||
| 11 | + RELEVANCE_GRADE_MAP, | ||
| 12 | + RELEVANCE_HIGH, | ||
| 13 | + RELEVANCE_IRRELEVANT, | ||
| 14 | + RELEVANCE_LOW, | ||
| 15 | + RELEVANCE_NON_IRRELEVANT, | ||
| 16 | + RELEVANCE_STRONG, | ||
| 17 | +) | ||
| 8 | 18 | ||
| 9 | 19 | ||
| 10 | -def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -> float: | 20 | +def _normalize_label(label: str) -> str: |
| 21 | + if label in RELEVANCE_GRADE_MAP: | ||
| 22 | + return label | ||
| 23 | + return RELEVANCE_IRRELEVANT | ||
| 24 | + | ||
| 25 | + | ||
| 26 | +def _gains_for_labels(labels: Sequence[str]) -> list[float]: | ||
| 27 | + return [float(RELEVANCE_GAIN_MAP.get(_normalize_label(label), 0.0)) for label in labels] | ||
| 28 | + | ||
| 29 | + | ||
| 30 | +def _binary_hits(labels: Sequence[str], relevant: Iterable[str]) -> list[int]: | ||
| 31 | + relevant_set = set(relevant) | ||
| 32 | + return [1 if _normalize_label(label) in relevant_set else 0 for label in labels] | ||
| 33 | + | ||
| 34 | + | ||
| 35 | +def _precision_at_k_from_hits(hits: Sequence[int], k: int) -> float: | ||
| 11 | if k <= 0: | 36 | if k <= 0: |
| 12 | return 0.0 | 37 | return 0.0 |
| 13 | - sliced = list(labels[:k]) | 38 | + sliced = list(hits[:k]) |
| 14 | if not sliced: | 39 | if not sliced: |
| 15 | return 0.0 | 40 | return 0.0 |
| 16 | - rel = set(relevant) | ||
| 17 | - hits = sum(1 for label in sliced if label in rel) | ||
| 18 | - return hits / float(min(k, len(sliced))) | 41 | + return sum(sliced) / float(len(sliced)) |
| 42 | + | ||
| 43 | + | ||
| 44 | +def _success_at_k_from_hits(hits: Sequence[int], k: int) -> float: | ||
| 45 | + if k <= 0: | ||
| 46 | + return 0.0 | ||
| 47 | + return 1.0 if any(hits[:k]) else 0.0 | ||
| 48 | + | ||
| 49 | + | ||
| 50 | +def _reciprocal_rank_from_hits(hits: Sequence[int], k: int) -> float: | ||
| 51 | + if k <= 0: | ||
| 52 | + return 0.0 | ||
| 53 | + for idx, hit in enumerate(hits[:k], start=1): | ||
| 54 | + if hit: | ||
| 55 | + return 1.0 / float(idx) | ||
| 56 | + return 0.0 | ||
| 19 | 57 | ||
| 20 | 58 | ||
| 21 | -def average_precision(labels: Sequence[str], relevant: Sequence[str]) -> float: | ||
| 22 | - rel = set(relevant) | ||
| 23 | - hit_count = 0 | ||
| 24 | - precision_sum = 0.0 | ||
| 25 | - for idx, label in enumerate(labels, start=1): | ||
| 26 | - if label not in rel: | 59 | +def _dcg_at_k(gains: Sequence[float], k: int) -> float: |
| 60 | + if k <= 0: | ||
| 61 | + return 0.0 | ||
| 62 | + total = 0.0 | ||
| 63 | + for idx, gain in enumerate(gains[:k], start=1): | ||
| 64 | + if gain <= 0.0: | ||
| 27 | continue | 65 | continue |
| 28 | - hit_count += 1 | ||
| 29 | - precision_sum += hit_count / idx | ||
| 30 | - if hit_count == 0: | 66 | + total += gain / math.log2(idx + 1.0) |
| 67 | + return total | ||
| 68 | + | ||
| 69 | + | ||
| 70 | +def _ndcg_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float: | ||
| 71 | + actual_gains = _gains_for_labels(labels) | ||
| 72 | + ideal_gains = sorted(_gains_for_labels(ideal_labels), reverse=True) | ||
| 73 | + dcg = _dcg_at_k(actual_gains, k) | ||
| 74 | + idcg = _dcg_at_k(ideal_gains, k) | ||
| 75 | + if idcg <= 0.0: | ||
| 76 | + return 0.0 | ||
| 77 | + return dcg / idcg | ||
| 78 | + | ||
| 79 | + | ||
| 80 | +def _gain_recall_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float: | ||
| 81 | + ideal_total_gain = sum(_gains_for_labels(ideal_labels)) | ||
| 82 | + if ideal_total_gain <= 0.0: | ||
| 31 | return 0.0 | 83 | return 0.0 |
| 32 | - return precision_sum / hit_count | 84 | + actual_gain = sum(_gains_for_labels(labels[:k])) |
| 85 | + return actual_gain / ideal_total_gain | ||
| 33 | 86 | ||
| 34 | 87 | ||
| 35 | -def compute_query_metrics(labels: Sequence[str]) -> Dict[str, float]: | ||
| 36 | - """P@k / MAP_3: Exact Match only. P@k_2_3 / MAP_2_3: any non-irrelevant tier (legacy metric names).""" | 88 | +def _grade_avg_at_k(labels: Sequence[str], k: int) -> float: |
| 89 | + if k <= 0: | ||
| 90 | + return 0.0 | ||
| 91 | + sliced = [_normalize_label(label) for label in labels[:k]] | ||
| 92 | + if not sliced: | ||
| 93 | + return 0.0 | ||
| 94 | + return sum(float(RELEVANCE_GRADE_MAP.get(label, 0)) for label in sliced) / float(len(sliced)) | ||
| 95 | + | ||
| 96 | + | ||
| 97 | +def compute_query_metrics( | ||
| 98 | + labels: Sequence[str], | ||
| 99 | + *, | ||
| 100 | + ideal_labels: Sequence[str] | None = None, | ||
| 101 | +) -> Dict[str, float]: | ||
| 102 | + """Compute graded ranking metrics plus binary diagnostic slices. | ||
| 103 | + | ||
| 104 | + `labels` are the ranked results returned by search. | ||
| 105 | + `ideal_labels` is the judged label pool for the same query; when omitted we fall back | ||
| 106 | + to the retrieved labels, which still keeps the metrics well-defined. | ||
| 107 | + """ | ||
| 108 | + | ||
| 109 | + ideal = list(ideal_labels) if ideal_labels is not None else list(labels) | ||
| 37 | metrics: Dict[str, float] = {} | 110 | metrics: Dict[str, float] = {} |
| 38 | - non_irrel = list(RELEVANCE_NON_IRRELEVANT) | 111 | + |
| 112 | + exact_hits = _binary_hits(labels, [RELEVANCE_EXACT]) | ||
| 113 | + strong_hits = _binary_hits(labels, RELEVANCE_STRONG) | ||
| 114 | + useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT) | ||
| 115 | + | ||
| 39 | for k in (5, 10, 20, 50): | 116 | for k in (5, 10, 20, 50): |
| 40 | - metrics[f"P@{k}"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT]), 6) | ||
| 41 | - metrics[f"P@{k}_2_3"] = round(precision_at_k(labels, k, non_irrel), 6) | ||
| 42 | - metrics["MAP_3"] = round(average_precision(labels, [RELEVANCE_EXACT]), 6) | ||
| 43 | - metrics["MAP_2_3"] = round(average_precision(labels, non_irrel), 6) | 117 | + metrics[f"NDCG@{k}"] = round(_ndcg_at_k(labels, ideal, k), 6) |
| 118 | + for k in (5, 10, 20): | ||
| 119 | + metrics[f"Exact_Precision@{k}"] = round(_precision_at_k_from_hits(exact_hits, k), 6) | ||
| 120 | + metrics[f"Strong_Precision@{k}"] = round(_precision_at_k_from_hits(strong_hits, k), 6) | ||
| 121 | + for k in (10, 20, 50): | ||
| 122 | + metrics[f"Useful_Precision@{k}"] = round(_precision_at_k_from_hits(useful_hits, k), 6) | ||
| 123 | + metrics[f"Gain_Recall@{k}"] = round(_gain_recall_at_k(labels, ideal, k), 6) | ||
| 124 | + for k in (5, 10): | ||
| 125 | + metrics[f"Exact_Success@{k}"] = round(_success_at_k_from_hits(exact_hits, k), 6) | ||
| 126 | + metrics[f"Strong_Success@{k}"] = round(_success_at_k_from_hits(strong_hits, k), 6) | ||
| 127 | + metrics["MRR_Exact@10"] = round(_reciprocal_rank_from_hits(exact_hits, 10), 6) | ||
| 128 | + metrics["MRR_Strong@10"] = round(_reciprocal_rank_from_hits(strong_hits, 10), 6) | ||
| 129 | + metrics["Avg_Grade@10"] = round(_grade_avg_at_k(labels, 10), 6) | ||
| 44 | return metrics | 130 | return metrics |
| 45 | 131 | ||
| 46 | 132 | ||
| 47 | def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, float]: | 133 | def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, float]: |
| 48 | if not metric_items: | 134 | if not metric_items: |
| 49 | return {} | 135 | return {} |
| 50 | - keys = sorted(metric_items[0].keys()) | 136 | + all_keys = sorted({key for item in metric_items for key in item.keys()}) |
| 51 | return { | 137 | return { |
| 52 | key: round(sum(float(item.get(key, 0.0)) for item in metric_items) / len(metric_items), 6) | 138 | key: round(sum(float(item.get(key, 0.0)) for item in metric_items) / len(metric_items), 6) |
| 53 | - for key in keys | 139 | + for key in all_keys |
| 54 | } | 140 | } |
| 55 | 141 | ||
| 56 | 142 |
scripts/evaluation/eval_framework/reports.py
| @@ -7,6 +7,19 @@ from typing import Any, Dict | @@ -7,6 +7,19 @@ from typing import Any, Dict | ||
| 7 | from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW | 7 | from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW |
| 8 | 8 | ||
| 9 | 9 | ||
| 10 | +def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -> None: | ||
| 11 | + primary_keys = ("NDCG@5", "NDCG@10", "NDCG@20", "Exact_Precision@10", "Strong_Precision@10", "Gain_Recall@50") | ||
| 12 | + included = set() | ||
| 13 | + for key in primary_keys: | ||
| 14 | + if key in metrics: | ||
| 15 | + lines.append(f"- {key}: {metrics[key]}") | ||
| 16 | + included.add(key) | ||
| 17 | + for key, value in sorted(metrics.items()): | ||
| 18 | + if key in included: | ||
| 19 | + continue | ||
| 20 | + lines.append(f"- {key}: {value}") | ||
| 21 | + | ||
| 22 | + | ||
| 10 | def render_batch_report_markdown(payload: Dict[str, Any]) -> str: | 23 | def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 11 | lines = [ | 24 | lines = [ |
| 12 | "# Search Batch Evaluation", | 25 | "# Search Batch Evaluation", |
| @@ -20,8 +33,16 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: | @@ -20,8 +33,16 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: | ||
| 20 | "## Aggregate Metrics", | 33 | "## Aggregate Metrics", |
| 21 | "", | 34 | "", |
| 22 | ] | 35 | ] |
| 23 | - for key, value in sorted((payload.get("aggregate_metrics") or {}).items()): | ||
| 24 | - lines.append(f"- {key}: {value}") | 36 | + metric_context = payload.get("metric_context") or {} |
| 37 | + if metric_context: | ||
| 38 | + lines.extend( | ||
| 39 | + [ | ||
| 40 | + f"- Primary metric: {metric_context.get('primary_metric', 'N/A')}", | ||
| 41 | + f"- Gain scheme: {metric_context.get('gain_scheme', {})}", | ||
| 42 | + "", | ||
| 43 | + ] | ||
| 44 | + ) | ||
| 45 | + _append_metric_block(lines, payload.get("aggregate_metrics") or {}) | ||
| 25 | distribution = payload.get("aggregate_distribution") or {} | 46 | distribution = payload.get("aggregate_distribution") or {} |
| 26 | if distribution: | 47 | if distribution: |
| 27 | lines.extend( | 48 | lines.extend( |
| @@ -39,8 +60,7 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: | @@ -39,8 +60,7 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: | ||
| 39 | for item in payload.get("per_query") or []: | 60 | for item in payload.get("per_query") or []: |
| 40 | lines.append(f"### {item['query']}") | 61 | lines.append(f"### {item['query']}") |
| 41 | lines.append("") | 62 | lines.append("") |
| 42 | - for key, value in sorted((item.get("metrics") or {}).items()): | ||
| 43 | - lines.append(f"- {key}: {value}") | 63 | + _append_metric_block(lines, item.get("metrics") or {}) |
| 44 | distribution = item.get("distribution") or {} | 64 | distribution = item.get("distribution") or {} |
| 45 | lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}") | 65 | lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}") |
| 46 | lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}") | 66 | lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}") |
scripts/evaluation/eval_framework/static/eval_web.css
| @@ -6,7 +6,8 @@ | @@ -6,7 +6,8 @@ | ||
| 6 | --line: #ddd4c6; | 6 | --line: #ddd4c6; |
| 7 | --accent: #0f766e; | 7 | --accent: #0f766e; |
| 8 | --exact: #0f766e; | 8 | --exact: #0f766e; |
| 9 | - --partial: #b7791f; | 9 | + --high: #b7791f; |
| 10 | + --low: #3b82a0; | ||
| 10 | --irrelevant: #b42318; | 11 | --irrelevant: #b42318; |
| 11 | } | 12 | } |
| 12 | body { margin: 0; font-family: "IBM Plex Sans", "Segoe UI", sans-serif; color: var(--ink); background: | 13 | body { margin: 0; font-family: "IBM Plex Sans", "Segoe UI", sans-serif; color: var(--ink); background: |
| @@ -29,6 +30,12 @@ | @@ -29,6 +30,12 @@ | ||
| 29 | button { border: 0; background: var(--accent); color: white; padding: 12px 16px; border-radius: 14px; cursor: pointer; font-weight: 600; } | 30 | button { border: 0; background: var(--accent); color: white; padding: 12px 16px; border-radius: 14px; cursor: pointer; font-weight: 600; } |
| 30 | button.secondary { background: #d9e6e3; color: #12433d; } | 31 | button.secondary { background: #d9e6e3; color: #12433d; } |
| 31 | .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(170px, 1fr)); gap: 12px; margin-bottom: 16px; } | 32 | .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(170px, 1fr)); gap: 12px; margin-bottom: 16px; } |
| 33 | + .metric-context { margin: 0 0 12px; line-height: 1.5; } | ||
| 34 | + .metric-section { margin-bottom: 18px; } | ||
| 35 | + .metric-section-head { display: flex; align-items: baseline; justify-content: space-between; gap: 12px; margin-bottom: 10px; } | ||
| 36 | + .metric-section-head h3 { margin: 0; font-size: 14px; color: #12433d; } | ||
| 37 | + .metric-section-head p { margin: 0; color: var(--muted); font-size: 12px; } | ||
| 38 | + .metric-grid { margin-bottom: 0; } | ||
| 32 | .metric { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; } | 39 | .metric { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; } |
| 33 | .metric .label { font-size: 12px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; } | 40 | .metric .label { font-size: 12px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; } |
| 34 | .metric .value { font-size: 24px; font-weight: 700; margin-top: 4px; } | 41 | .metric .value { font-size: 24px; font-weight: 700; margin-top: 4px; } |
| @@ -36,8 +43,8 @@ | @@ -36,8 +43,8 @@ | ||
| 36 | .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; } | 43 | .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; } |
| 37 | .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; } | 44 | .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; } |
| 38 | .label-exact-match { background: var(--exact); } | 45 | .label-exact-match { background: var(--exact); } |
| 39 | - .label-high-relevant { background: var(--partial); } | ||
| 40 | - .label-low-relevant { background: #6b5b95; } | 46 | + .label-high-relevant { background: var(--high); } |
| 47 | + .label-low-relevant { background: var(--low); } | ||
| 41 | .label-irrelevant { background: var(--irrelevant); } | 48 | .label-irrelevant { background: var(--irrelevant); } |
| 42 | .badge-unknown { background: #637381; } | 49 | .badge-unknown { background: #637381; } |
| 43 | .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } | 50 | .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } |
| @@ -91,3 +98,13 @@ | @@ -91,3 +98,13 @@ | ||
| 91 | .report-modal-body.report-modal-loading, .report-modal-body.report-modal-error { color: var(--muted); font-style: italic; } | 98 | .report-modal-body.report-modal-loading, .report-modal-body.report-modal-error { color: var(--muted); font-style: italic; } |
| 92 | .tips { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; line-height: 1.6; } | 99 | .tips { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; line-height: 1.6; } |
| 93 | .tip { margin-bottom: 6px; color: var(--muted); } | 100 | .tip { margin-bottom: 6px; color: var(--muted); } |
| 101 | + @media (max-width: 960px) { | ||
| 102 | + .app { grid-template-columns: 1fr; } | ||
| 103 | + .sidebar { border-right: 0; border-bottom: 1px solid var(--line); } | ||
| 104 | + .metric-section-head { flex-direction: column; align-items: flex-start; } | ||
| 105 | + } | ||
| 106 | + @media (max-width: 640px) { | ||
| 107 | + .main, .sidebar { padding: 16px; } | ||
| 108 | + .result { grid-template-columns: 1fr; } | ||
| 109 | + .thumb { width: 100%; max-width: 180px; height: auto; aspect-ratio: 1 / 1; } | ||
| 110 | + } |
scripts/evaluation/eval_framework/static/eval_web.js
| 1 | - async function fetchJSON(url, options) { | ||
| 2 | - const res = await fetch(url, options); | ||
| 3 | - if (!res.ok) throw new Error(await res.text()); | ||
| 4 | - return await res.json(); | ||
| 5 | - } | ||
| 6 | - function renderMetrics(metrics) { | ||
| 7 | - const root = document.getElementById('metrics'); | ||
| 8 | - root.innerHTML = ''; | ||
| 9 | - Object.entries(metrics || {}).forEach(([key, value]) => { | ||
| 10 | - const card = document.createElement('div'); | ||
| 11 | - card.className = 'metric'; | ||
| 12 | - card.innerHTML = `<div class="label">${key}</div><div class="value">${value}</div>`; | ||
| 13 | - root.appendChild(card); | ||
| 14 | - }); | ||
| 15 | - } | ||
| 16 | - function labelBadgeClass(label) { | ||
| 17 | - if (!label || label === 'Unknown') return 'badge-unknown'; | ||
| 18 | - return 'label-' + String(label).toLowerCase().replace(/\s+/g, '-'); | ||
| 19 | - } | ||
| 20 | - function renderResults(results, rootId='results', showRank=true) { | ||
| 21 | - const mount = document.getElementById(rootId); | ||
| 22 | - mount.innerHTML = ''; | ||
| 23 | - (results || []).forEach(item => { | ||
| 24 | - const label = item.label || 'Unknown'; | ||
| 25 | - const box = document.createElement('div'); | ||
| 26 | - box.className = 'result'; | ||
| 27 | - box.innerHTML = ` | ||
| 28 | - <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || '-'}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : 'not recalled')}</div></div> | ||
| 29 | - <img class="thumb" src="${item.image_url || ''}" alt="" /> | ||
| 30 | - <div> | ||
| 31 | - <div class="title">${item.title || ''}</div> | ||
| 32 | - ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''} | ||
| 33 | - <div class="options"> | ||
| 34 | - <div>${(item.option_values || [])[0] || ''}</div> | ||
| 35 | - <div>${(item.option_values || [])[1] || ''}</div> | ||
| 36 | - <div>${(item.option_values || [])[2] || ''}</div> | ||
| 37 | - </div> | ||
| 38 | - </div>`; | ||
| 39 | - mount.appendChild(box); | ||
| 40 | - }); | ||
| 41 | - if (!(results || []).length) { | ||
| 42 | - mount.innerHTML = '<div class="muted">None.</div>'; | ||
| 43 | - } | ||
| 44 | - } | ||
| 45 | - function renderTips(data) { | ||
| 46 | - const root = document.getElementById('tips'); | ||
| 47 | - const tips = [...(data.tips || [])]; | ||
| 48 | - const stats = data.label_stats || {}; | ||
| 49 | - tips.unshift(`Cached labels for query: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed (non-irrelevant): ${stats.missing_relevant_count || 0} — Exact: ${stats.missing_exact_count || 0}, High: ${stats.missing_high_count || 0}, Low: ${stats.missing_low_count || 0}.`); | ||
| 50 | - root.innerHTML = tips.map(text => `<div class="tip">${text}</div>`).join(''); | ||
| 51 | - } | ||
| 52 | - async function loadQueries() { | ||
| 53 | - const data = await fetchJSON('/api/queries'); | ||
| 54 | - const root = document.getElementById('queryList'); | ||
| 55 | - root.innerHTML = ''; | ||
| 56 | - data.queries.forEach(query => { | ||
| 57 | - const btn = document.createElement('button'); | ||
| 58 | - btn.className = 'query-item'; | ||
| 59 | - btn.textContent = query; | ||
| 60 | - btn.onclick = () => { | ||
| 61 | - document.getElementById('queryInput').value = query; | ||
| 62 | - runSingle(); | ||
| 63 | - }; | ||
| 64 | - root.appendChild(btn); | ||
| 65 | - }); | ||
| 66 | - } | ||
| 67 | - function fmtMetric(m, key, digits) { | ||
| 68 | - const v = m && m[key]; | ||
| 69 | - if (v == null || Number.isNaN(Number(v))) return null; | ||
| 70 | - const n = Number(v); | ||
| 71 | - return n.toFixed(digits); | ||
| 72 | - } | ||
| 73 | - function historySummaryHtml(meta) { | ||
| 74 | - const m = meta && meta.aggregate_metrics; | ||
| 75 | - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; | ||
| 76 | - const parts = []; | ||
| 77 | - if (nq != null) parts.push(`<span>Queries</span> ${nq}`); | ||
| 78 | - const p10 = fmtMetric(m, 'P@10', 3); | ||
| 79 | - const p52 = fmtMetric(m, 'P@5_2_3', 3); | ||
| 80 | - const map3 = fmtMetric(m, 'MAP_3', 3); | ||
| 81 | - if (p10) parts.push(`<span>P@10</span> ${p10}`); | ||
| 82 | - if (p52) parts.push(`<span>P@5_2_3</span> ${p52}`); | ||
| 83 | - if (map3) parts.push(`<span>MAP_3</span> ${map3}`); | ||
| 84 | - if (!parts.length) return ''; | ||
| 85 | - return `<div class="hstats">${parts.join(' · ')}</div>`; | ||
| 86 | - } | ||
| 87 | - async function loadHistory() { | ||
| 88 | - const data = await fetchJSON('/api/history'); | ||
| 89 | - const root = document.getElementById('history'); | ||
| 90 | - root.classList.remove('muted'); | ||
| 91 | - const items = data.history || []; | ||
| 92 | - if (!items.length) { | ||
| 93 | - root.innerHTML = '<span class="muted">No history yet.</span>'; | ||
| 94 | - return; | ||
| 95 | - } | ||
| 96 | - root.innerHTML = `<div class="history-list"></div>`; | ||
| 97 | - const list = root.querySelector('.history-list'); | ||
| 98 | - items.forEach(item => { | ||
| 99 | - const btn = document.createElement('button'); | ||
| 100 | - btn.type = 'button'; | ||
| 101 | - btn.className = 'history-item'; | ||
| 102 | - btn.setAttribute('aria-label', `Open report ${item.batch_id}`); | ||
| 103 | - const sum = historySummaryHtml(item.metadata); | ||
| 104 | - btn.innerHTML = `<div class="hid">${item.batch_id}</div> | ||
| 105 | - <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`; | ||
| 106 | - btn.onclick = () => openBatchReport(item.batch_id); | ||
| 107 | - list.appendChild(btn); | ||
| 108 | - }); | ||
| 109 | - } | ||
| 110 | - let _lastReportPath = ''; | ||
| 111 | - function closeReportModal() { | ||
| 112 | - const el = document.getElementById('reportModal'); | ||
| 113 | - el.classList.remove('is-open'); | ||
| 114 | - el.setAttribute('aria-hidden', 'true'); | ||
| 115 | - document.getElementById('reportModalBody').innerHTML = ''; | ||
| 116 | - document.getElementById('reportModalMeta').textContent = ''; | ||
| 117 | - } | ||
| 118 | - async function openBatchReport(batchId) { | ||
| 119 | - const el = document.getElementById('reportModal'); | ||
| 120 | - const body = document.getElementById('reportModalBody'); | ||
| 121 | - const metaEl = document.getElementById('reportModalMeta'); | ||
| 122 | - const titleEl = document.getElementById('reportModalTitle'); | ||
| 123 | - el.classList.add('is-open'); | ||
| 124 | - el.setAttribute('aria-hidden', 'false'); | ||
| 125 | - titleEl.textContent = batchId; | ||
| 126 | - metaEl.textContent = ''; | ||
| 127 | - body.className = 'report-modal-body batch-report-md report-modal-loading'; | ||
| 128 | - body.textContent = 'Loading report…'; | ||
| 129 | - try { | ||
| 130 | - const rep = await fetchJSON('/api/history/' + encodeURIComponent(batchId) + '/report'); | ||
| 131 | - _lastReportPath = rep.report_markdown_path || ''; | ||
| 132 | - metaEl.textContent = rep.report_markdown_path || ''; | ||
| 133 | - const raw = marked.parse(rep.markdown || '', { gfm: true }); | ||
| 134 | - const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } }); | ||
| 135 | - body.className = 'report-modal-body batch-report-md'; | ||
| 136 | - body.innerHTML = safe; | ||
| 137 | - } catch (e) { | ||
| 138 | - body.className = 'report-modal-body report-modal-error'; | ||
| 139 | - body.textContent = (e && e.message) ? e.message : String(e); | ||
| 140 | - } | ||
| 141 | - } | ||
| 142 | - document.getElementById('reportModal').addEventListener('click', (ev) => { | ||
| 143 | - if (ev.target && ev.target.getAttribute('data-close-report') === '1') closeReportModal(); | 1 | +async function fetchJSON(url, options) { |
| 2 | + const res = await fetch(url, options); | ||
| 3 | + if (!res.ok) throw new Error(await res.text()); | ||
| 4 | + return await res.json(); | ||
| 5 | +} | ||
| 6 | + | ||
| 7 | +function fmtNumber(value, digits = 3) { | ||
| 8 | + if (value == null || Number.isNaN(Number(value))) return "-"; | ||
| 9 | + return Number(value).toFixed(digits); | ||
| 10 | +} | ||
| 11 | + | ||
| 12 | +function metricSections(metrics) { | ||
| 13 | + const groups = [ | ||
| 14 | + { | ||
| 15 | + title: "Primary Ranking", | ||
| 16 | + keys: ["NDCG@5", "NDCG@10", "NDCG@20", "NDCG@50"], | ||
| 17 | + description: "Graded ranking quality across the four relevance tiers.", | ||
| 18 | + }, | ||
| 19 | + { | ||
| 20 | + title: "Top Slot Quality", | ||
| 21 | + keys: ["Exact_Precision@5", "Exact_Precision@10", "Strong_Precision@5", "Strong_Precision@10", "Strong_Precision@20"], | ||
| 22 | + description: "How much of the visible top rank is exact or strong business relevance.", | ||
| 23 | + }, | ||
| 24 | + { | ||
| 25 | + title: "Recall Coverage", | ||
| 26 | + keys: ["Useful_Precision@10", "Useful_Precision@20", "Useful_Precision@50", "Gain_Recall@10", "Gain_Recall@20", "Gain_Recall@50"], | ||
| 27 | + description: "How much judged relevance is captured in the returned list.", | ||
| 28 | + }, | ||
| 29 | + { | ||
| 30 | + title: "First Good Result", | ||
| 31 | + keys: ["Exact_Success@5", "Exact_Success@10", "Strong_Success@5", "Strong_Success@10", "MRR_Exact@10", "MRR_Strong@10", "Avg_Grade@10"], | ||
| 32 | + description: "Whether users see a good result early and how good the top page feels overall.", | ||
| 33 | + }, | ||
| 34 | + ]; | ||
| 35 | + const seen = new Set(); | ||
| 36 | + return groups | ||
| 37 | + .map((group) => { | ||
| 38 | + const items = group.keys | ||
| 39 | + .filter((key) => metrics && Object.prototype.hasOwnProperty.call(metrics, key)) | ||
| 40 | + .map((key) => { | ||
| 41 | + seen.add(key); | ||
| 42 | + return [key, metrics[key]]; | ||
| 43 | + }); | ||
| 44 | + return { ...group, items }; | ||
| 45 | + }) | ||
| 46 | + .filter((group) => group.items.length) | ||
| 47 | + .concat( | ||
| 48 | + (() => { | ||
| 49 | + const rest = Object.entries(metrics || {}).filter(([key]) => !seen.has(key)); | ||
| 50 | + return rest.length | ||
| 51 | + ? [{ title: "Other Metrics", description: "", items: rest }] | ||
| 52 | + : []; | ||
| 53 | + })() | ||
| 54 | + ); | ||
| 55 | +} | ||
| 56 | + | ||
| 57 | +function renderMetrics(metrics, metricContext) { | ||
| 58 | + const root = document.getElementById("metrics"); | ||
| 59 | + root.innerHTML = ""; | ||
| 60 | + const ctx = document.getElementById("metricContext"); | ||
| 61 | + const gainScheme = metricContext && metricContext.gain_scheme; | ||
| 62 | + const primary = metricContext && metricContext.primary_metric; | ||
| 63 | + ctx.textContent = primary | ||
| 64 | + ? `Primary metric: ${primary}. Gain scheme: ${Object.entries(gainScheme || {}).map(([label, gain]) => `${label}=${gain}`).join(", ")}.` | ||
| 65 | + : ""; | ||
| 66 | + | ||
| 67 | + metricSections(metrics || {}).forEach((section) => { | ||
| 68 | + const wrap = document.createElement("section"); | ||
| 69 | + wrap.className = "metric-section"; | ||
| 70 | + wrap.innerHTML = ` | ||
| 71 | + <div class="metric-section-head"> | ||
| 72 | + <h3>${section.title}</h3> | ||
| 73 | + ${section.description ? `<p>${section.description}</p>` : ""} | ||
| 74 | + </div> | ||
| 75 | + <div class="grid metric-grid"></div> | ||
| 76 | + `; | ||
| 77 | + const grid = wrap.querySelector(".metric-grid"); | ||
| 78 | + section.items.forEach(([key, value]) => { | ||
| 79 | + const card = document.createElement("div"); | ||
| 80 | + card.className = "metric"; | ||
| 81 | + card.innerHTML = `<div class="label">${key}</div><div class="value">${fmtNumber(value)}</div>`; | ||
| 82 | + grid.appendChild(card); | ||
| 144 | }); | 83 | }); |
| 145 | - document.addEventListener('keydown', (ev) => { | ||
| 146 | - if (ev.key === 'Escape') closeReportModal(); | ||
| 147 | - }); | ||
| 148 | - document.getElementById('reportCopyPath').addEventListener('click', async () => { | ||
| 149 | - if (!_lastReportPath) return; | ||
| 150 | - try { | ||
| 151 | - await navigator.clipboard.writeText(_lastReportPath); | ||
| 152 | - } catch (_) {} | ||
| 153 | - }); | ||
| 154 | - async function runSingle() { | ||
| 155 | - const query = document.getElementById('queryInput').value.trim(); | ||
| 156 | - if (!query) return; | ||
| 157 | - document.getElementById('status').textContent = `Evaluating "${query}"...`; | ||
| 158 | - const data = await fetchJSON('/api/search-eval', { | ||
| 159 | - method: 'POST', | ||
| 160 | - headers: {'Content-Type': 'application/json'}, | ||
| 161 | - body: JSON.stringify({query, top_k: 100, auto_annotate: false}) | ||
| 162 | - }); | ||
| 163 | - document.getElementById('status').textContent = `Done. total=${data.total}`; | ||
| 164 | - renderMetrics(data.metrics); | ||
| 165 | - renderResults(data.results, 'results', true); | ||
| 166 | - renderResults(data.missing_relevant, 'missingRelevant', false); | ||
| 167 | - renderTips(data); | ||
| 168 | - loadHistory(); | ||
| 169 | - } | ||
| 170 | - async function runBatch() { | ||
| 171 | - document.getElementById('status').textContent = 'Running batch evaluation...'; | ||
| 172 | - const data = await fetchJSON('/api/batch-eval', { | ||
| 173 | - method: 'POST', | ||
| 174 | - headers: {'Content-Type': 'application/json'}, | ||
| 175 | - body: JSON.stringify({top_k: 100, auto_annotate: false}) | ||
| 176 | - }); | ||
| 177 | - document.getElementById('status').textContent = `Batch done. report=${data.batch_id}`; | ||
| 178 | - renderMetrics(data.aggregate_metrics); | ||
| 179 | - renderResults([], 'results', true); | ||
| 180 | - renderResults([], 'missingRelevant', false); | ||
| 181 | - document.getElementById('tips').innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>'; | ||
| 182 | - loadHistory(); | ||
| 183 | - } | ||
| 184 | - loadQueries(); | ||
| 185 | - loadHistory(); | ||
| 186 | - | 84 | + root.appendChild(wrap); |
| 85 | + }); | ||
| 86 | +} | ||
| 87 | + | ||
| 88 | +function labelBadgeClass(label) { | ||
| 89 | + if (!label || label === "Unknown") return "badge-unknown"; | ||
| 90 | + return "label-" + String(label).toLowerCase().replace(/\s+/g, "-"); | ||
| 91 | +} | ||
| 92 | + | ||
| 93 | +function renderResults(results, rootId = "results", showRank = true) { | ||
| 94 | + const mount = document.getElementById(rootId); | ||
| 95 | + mount.innerHTML = ""; | ||
| 96 | + (results || []).forEach((item) => { | ||
| 97 | + const label = item.label || "Unknown"; | ||
| 98 | + const box = document.createElement("div"); | ||
| 99 | + box.className = "result"; | ||
| 100 | + box.innerHTML = ` | ||
| 101 | + <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || "-"}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : "not recalled")}</div></div> | ||
| 102 | + <img class="thumb" src="${item.image_url || ""}" alt="" /> | ||
| 103 | + <div> | ||
| 104 | + <div class="title">${item.title || ""}</div> | ||
| 105 | + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ""} | ||
| 106 | + <div class="options"> | ||
| 107 | + <div>${(item.option_values || [])[0] || ""}</div> | ||
| 108 | + <div>${(item.option_values || [])[1] || ""}</div> | ||
| 109 | + <div>${(item.option_values || [])[2] || ""}</div> | ||
| 110 | + </div> | ||
| 111 | + </div>`; | ||
| 112 | + mount.appendChild(box); | ||
| 113 | + }); | ||
| 114 | + if (!(results || []).length) { | ||
| 115 | + mount.innerHTML = '<div class="muted">None.</div>'; | ||
| 116 | + } | ||
| 117 | +} | ||
| 118 | + | ||
| 119 | +function renderTips(data) { | ||
| 120 | + const root = document.getElementById("tips"); | ||
| 121 | + const tips = [...(data.tips || [])]; | ||
| 122 | + const stats = data.label_stats || {}; | ||
| 123 | + tips.unshift( | ||
| 124 | + `Cached labels: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed judged useful results: ${stats.missing_relevant_count || 0} (Exact ${stats.missing_exact_count || 0}, High ${stats.missing_high_count || 0}, Low ${stats.missing_low_count || 0}).` | ||
| 125 | + ); | ||
| 126 | + root.innerHTML = tips.map((text) => `<div class="tip">${text}</div>`).join(""); | ||
| 127 | +} | ||
| 128 | + | ||
| 129 | +async function loadQueries() { | ||
| 130 | + const data = await fetchJSON("/api/queries"); | ||
| 131 | + const root = document.getElementById("queryList"); | ||
| 132 | + root.innerHTML = ""; | ||
| 133 | + data.queries.forEach((query) => { | ||
| 134 | + const btn = document.createElement("button"); | ||
| 135 | + btn.className = "query-item"; | ||
| 136 | + btn.textContent = query; | ||
| 137 | + btn.onclick = () => { | ||
| 138 | + document.getElementById("queryInput").value = query; | ||
| 139 | + runSingle(); | ||
| 140 | + }; | ||
| 141 | + root.appendChild(btn); | ||
| 142 | + }); | ||
| 143 | +} | ||
| 144 | + | ||
| 145 | +function historySummaryHtml(meta) { | ||
| 146 | + const m = meta && meta.aggregate_metrics; | ||
| 147 | + const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; | ||
| 148 | + const parts = []; | ||
| 149 | + if (nq != null) parts.push(`<span>Queries</span> ${nq}`); | ||
| 150 | + if (m && m["NDCG@10"] != null) parts.push(`<span>NDCG@10</span> ${fmtNumber(m["NDCG@10"])}`); | ||
| 151 | + if (m && m["Strong_Precision@10"] != null) parts.push(`<span>Strong@10</span> ${fmtNumber(m["Strong_Precision@10"])}`); | ||
| 152 | + if (m && m["Gain_Recall@50"] != null) parts.push(`<span>Gain Recall@50</span> ${fmtNumber(m["Gain_Recall@50"])}`); | ||
| 153 | + if (!parts.length) return ""; | ||
| 154 | + return `<div class="hstats">${parts.join(" · ")}</div>`; | ||
| 155 | +} | ||
| 156 | + | ||
| 157 | +async function loadHistory() { | ||
| 158 | + const data = await fetchJSON("/api/history"); | ||
| 159 | + const root = document.getElementById("history"); | ||
| 160 | + root.classList.remove("muted"); | ||
| 161 | + const items = data.history || []; | ||
| 162 | + if (!items.length) { | ||
| 163 | + root.innerHTML = '<span class="muted">No history yet.</span>'; | ||
| 164 | + return; | ||
| 165 | + } | ||
| 166 | + root.innerHTML = `<div class="history-list"></div>`; | ||
| 167 | + const list = root.querySelector(".history-list"); | ||
| 168 | + items.forEach((item) => { | ||
| 169 | + const btn = document.createElement("button"); | ||
| 170 | + btn.type = "button"; | ||
| 171 | + btn.className = "history-item"; | ||
| 172 | + btn.setAttribute("aria-label", `Open report ${item.batch_id}`); | ||
| 173 | + const sum = historySummaryHtml(item.metadata); | ||
| 174 | + btn.innerHTML = `<div class="hid">${item.batch_id}</div> | ||
| 175 | + <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`; | ||
| 176 | + btn.onclick = () => openBatchReport(item.batch_id); | ||
| 177 | + list.appendChild(btn); | ||
| 178 | + }); | ||
| 179 | +} | ||
| 180 | + | ||
| 181 | +let _lastReportPath = ""; | ||
| 182 | + | ||
| 183 | +function closeReportModal() { | ||
| 184 | + const el = document.getElementById("reportModal"); | ||
| 185 | + el.classList.remove("is-open"); | ||
| 186 | + el.setAttribute("aria-hidden", "true"); | ||
| 187 | + document.getElementById("reportModalBody").innerHTML = ""; | ||
| 188 | + document.getElementById("reportModalMeta").textContent = ""; | ||
| 189 | +} | ||
| 190 | + | ||
| 191 | +async function openBatchReport(batchId) { | ||
| 192 | + const el = document.getElementById("reportModal"); | ||
| 193 | + const body = document.getElementById("reportModalBody"); | ||
| 194 | + const metaEl = document.getElementById("reportModalMeta"); | ||
| 195 | + const titleEl = document.getElementById("reportModalTitle"); | ||
| 196 | + el.classList.add("is-open"); | ||
| 197 | + el.setAttribute("aria-hidden", "false"); | ||
| 198 | + titleEl.textContent = batchId; | ||
| 199 | + metaEl.textContent = ""; | ||
| 200 | + body.className = "report-modal-body batch-report-md report-modal-loading"; | ||
| 201 | + body.textContent = "Loading report…"; | ||
| 202 | + try { | ||
| 203 | + const rep = await fetchJSON("/api/history/" + encodeURIComponent(batchId) + "/report"); | ||
| 204 | + _lastReportPath = rep.report_markdown_path || ""; | ||
| 205 | + metaEl.textContent = rep.report_markdown_path || ""; | ||
| 206 | + const raw = marked.parse(rep.markdown || "", { gfm: true }); | ||
| 207 | + const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } }); | ||
| 208 | + body.className = "report-modal-body batch-report-md"; | ||
| 209 | + body.innerHTML = safe; | ||
| 210 | + } catch (e) { | ||
| 211 | + body.className = "report-modal-body report-modal-error"; | ||
| 212 | + body.textContent = e && e.message ? e.message : String(e); | ||
| 213 | + } | ||
| 214 | +} | ||
| 215 | + | ||
| 216 | +document.getElementById("reportModal").addEventListener("click", (ev) => { | ||
| 217 | + if (ev.target && ev.target.getAttribute("data-close-report") === "1") closeReportModal(); | ||
| 218 | +}); | ||
| 219 | + | ||
| 220 | +document.addEventListener("keydown", (ev) => { | ||
| 221 | + if (ev.key === "Escape") closeReportModal(); | ||
| 222 | +}); | ||
| 223 | + | ||
| 224 | +document.getElementById("reportCopyPath").addEventListener("click", async () => { | ||
| 225 | + if (!_lastReportPath) return; | ||
| 226 | + try { | ||
| 227 | + await navigator.clipboard.writeText(_lastReportPath); | ||
| 228 | + } catch (_) {} | ||
| 229 | +}); | ||
| 230 | + | ||
| 231 | +async function runSingle() { | ||
| 232 | + const query = document.getElementById("queryInput").value.trim(); | ||
| 233 | + if (!query) return; | ||
| 234 | + document.getElementById("status").textContent = `Evaluating "${query}"...`; | ||
| 235 | + const data = await fetchJSON("/api/search-eval", { | ||
| 236 | + method: "POST", | ||
| 237 | + headers: { "Content-Type": "application/json" }, | ||
| 238 | + body: JSON.stringify({ query, top_k: 100, auto_annotate: false }), | ||
| 239 | + }); | ||
| 240 | + document.getElementById("status").textContent = `Done. total=${data.total}`; | ||
| 241 | + renderMetrics(data.metrics, data.metric_context); | ||
| 242 | + renderResults(data.results, "results", true); | ||
| 243 | + renderResults(data.missing_relevant, "missingRelevant", false); | ||
| 244 | + renderTips(data); | ||
| 245 | + loadHistory(); | ||
| 246 | +} | ||
| 247 | + | ||
| 248 | +async function runBatch() { | ||
| 249 | + document.getElementById("status").textContent = "Running batch evaluation..."; | ||
| 250 | + const data = await fetchJSON("/api/batch-eval", { | ||
| 251 | + method: "POST", | ||
| 252 | + headers: { "Content-Type": "application/json" }, | ||
| 253 | + body: JSON.stringify({ top_k: 100, auto_annotate: false }), | ||
| 254 | + }); | ||
| 255 | + document.getElementById("status").textContent = `Batch done. report=${data.batch_id}`; | ||
| 256 | + renderMetrics(data.aggregate_metrics, data.metric_context); | ||
| 257 | + renderResults([], "results", true); | ||
| 258 | + renderResults([], "missingRelevant", false); | ||
| 259 | + document.getElementById("tips").innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>'; | ||
| 260 | + loadHistory(); | ||
| 261 | +} | ||
| 262 | + | ||
| 263 | +loadQueries(); | ||
| 264 | +loadHistory(); |
scripts/evaluation/eval_framework/static/index.html
| @@ -30,6 +30,7 @@ | @@ -30,6 +30,7 @@ | ||
| 30 | <div id="status" class="muted section"></div> | 30 | <div id="status" class="muted section"></div> |
| 31 | <section class="section"> | 31 | <section class="section"> |
| 32 | <h2>Metrics</h2> | 32 | <h2>Metrics</h2> |
| 33 | + <p id="metricContext" class="muted metric-context"></p> | ||
| 33 | <div id="metrics" class="grid"></div> | 34 | <div id="metrics" class="grid"></div> |
| 34 | </section> | 35 | </section> |
| 35 | <section class="section"> | 36 | <section class="section"> |
| @@ -37,7 +38,7 @@ | @@ -37,7 +38,7 @@ | ||
| 37 | <div id="results" class="results"></div> | 38 | <div id="results" class="results"></div> |
| 38 | </section> | 39 | </section> |
| 39 | <section class="section"> | 40 | <section class="section"> |
| 40 | - <h2>Missed non-irrelevant (cached)</h2> | 41 | + <h2>Missed judged useful results</h2> |
| 41 | <div id="missingRelevant" class="results"></div> | 42 | <div id="missingRelevant" class="results"></div> |
| 42 | </section> | 43 | </section> |
| 43 | <section class="section"> | 44 | <section class="section"> |
| @@ -67,4 +68,4 @@ | @@ -67,4 +68,4 @@ | ||
| 67 | <script src="https://cdn.jsdelivr.net/npm/dompurify@3.1.6/dist/purify.min.js"></script> | 68 | <script src="https://cdn.jsdelivr.net/npm/dompurify@3.1.6/dist/purify.min.js"></script> |
| 68 | <script src="/static/eval_web.js"></script> | 69 | <script src="/static/eval_web.js"></script> |
| 69 | </body> | 70 | </body> |
| 70 | -</html> | ||
| 71 | \ No newline at end of file | 71 | \ No newline at end of file |
| 72 | +</html> |
scripts/evaluation/tune_fusion.py
| @@ -150,7 +150,7 @@ def render_markdown(summary: Dict[str, Any]) -> str: | @@ -150,7 +150,7 @@ def render_markdown(summary: Dict[str, Any]) -> str: | ||
| 150 | "", | 150 | "", |
| 151 | "## Experiments", | 151 | "## Experiments", |
| 152 | "", | 152 | "", |
| 153 | - "| Rank | Name | Score | MAP_3 | MAP_2_3 | P@5 | P@10 | Config |", | 153 | + "| Rank | Name | Score | NDCG@10 | NDCG@20 | Strong@10 | Gain Recall@50 | Config |", |
| 154 | "|---|---|---:|---:|---:|---:|---:|---|", | 154 | "|---|---|---:|---:|---:|---:|---:|---|", |
| 155 | ] | 155 | ] |
| 156 | for idx, item in enumerate(summary["experiments"], start=1): | 156 | for idx, item in enumerate(summary["experiments"], start=1): |
| @@ -162,10 +162,10 @@ def render_markdown(summary: Dict[str, Any]) -> str: | @@ -162,10 +162,10 @@ def render_markdown(summary: Dict[str, Any]) -> str: | ||
| 162 | str(idx), | 162 | str(idx), |
| 163 | item["name"], | 163 | item["name"], |
| 164 | str(item["score"]), | 164 | str(item["score"]), |
| 165 | - str(metrics.get("MAP_3", "")), | ||
| 166 | - str(metrics.get("MAP_2_3", "")), | ||
| 167 | - str(metrics.get("P@5", "")), | ||
| 168 | - str(metrics.get("P@10", "")), | 165 | + str(metrics.get("NDCG@10", "")), |
| 166 | + str(metrics.get("NDCG@20", "")), | ||
| 167 | + str(metrics.get("Strong_Precision@10", "")), | ||
| 168 | + str(metrics.get("Gain_Recall@50", "")), | ||
| 169 | item["config_snapshot_path"], | 169 | item["config_snapshot_path"], |
| 170 | ] | 170 | ] |
| 171 | ) | 171 | ) |
| @@ -206,7 +206,7 @@ def build_parser() -> argparse.ArgumentParser: | @@ -206,7 +206,7 @@ def build_parser() -> argparse.ArgumentParser: | ||
| 206 | parser.add_argument("--language", default="en") | 206 | parser.add_argument("--language", default="en") |
| 207 | parser.add_argument("--experiments-file", required=True) | 207 | parser.add_argument("--experiments-file", required=True) |
| 208 | parser.add_argument("--search-base-url", default="http://127.0.0.1:6002") | 208 | parser.add_argument("--search-base-url", default="http://127.0.0.1:6002") |
| 209 | - parser.add_argument("--score-metric", default="MAP_3") | 209 | + parser.add_argument("--score-metric", default="NDCG@10") |
| 210 | parser.add_argument("--apply-best", action="store_true") | 210 | parser.add_argument("--apply-best", action="store_true") |
| 211 | parser.add_argument("--force-refresh-labels-first-pass", action="store_true") | 211 | parser.add_argument("--force-refresh-labels-first-pass", action="store_true") |
| 212 | return parser | 212 | return parser |