Commit 7ddd4cb3acf5e2e0b748467448c83348c87eff20
1 parent
9df421ed
评估体系从三等级->四等级 Exact Match / High Relevant / Low Relevant /
Irrelevant
Showing
9 changed files
with
502 additions
and
241 deletions
Show diff stats
scripts/evaluation/README.md
| ... | ... | @@ -2,7 +2,7 @@ |
| 2 | 2 | |
| 3 | 3 | This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality. |
| 4 | 4 | |
| 5 | -**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete. | |
| 5 | +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system and ranking metrics centered on `NDCG`. | |
| 6 | 6 | |
| 7 | 7 | ## What it does |
| 8 | 8 | |
| ... | ... | @@ -112,9 +112,33 @@ Default root: `artifacts/search_evaluation/` |
| 112 | 112 | |
| 113 | 113 | ## Labels |
| 114 | 114 | |
| 115 | -- **Exact** — Matches intended product type and all explicit required attributes. | |
| 116 | -- **Partial** — Main intent matches; attributes missing, approximate, or weaker. | |
| 117 | -- **Irrelevant** — Type mismatch or conflicting required attributes. | |
| 115 | +- **Exact Match** — Matches intended product type and all explicit required attributes. | |
| 116 | +- **High Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off. | |
| 117 | +- **Low Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match. | |
| 118 | +- **Irrelevant** — Type mismatch or important conflicts make it a poor search result. | |
| 119 | + | |
| 120 | +## Metric design | |
| 121 | + | |
| 122 | +This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance. | |
| 123 | + | |
| 124 | +- **Primary metric: `NDCG@10`** | |
| 125 | + Uses the four labels as graded gains and rewards both relevance and early placement. | |
| 126 | +- **Gain scheme** | |
| 127 | + `Exact Match=7`, `High Relevant=3`, `Low Relevant=1`, `Irrelevant=0` | |
| 128 | + The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup. | |
| 129 | +- **Why this is better** | |
| 130 | + `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Exact Match` with a `Low Relevant` item is penalized more than swapping `High Relevant` with `Low Relevant`. | |
| 131 | + | |
| 132 | +The reported metrics are: | |
| 133 | + | |
| 134 | +- **`NDCG@5`, `NDCG@10`, `NDCG@20`, `NDCG@50`** — Primary graded ranking quality. | |
| 135 | +- **`Exact_Precision@K`** — Strict top-slot quality when only `Exact Match` counts. | |
| 136 | +- **`Strong_Precision@K`** — Business-facing top-slot quality where `Exact Match + High Relevant` count as strong positives. | |
| 137 | +- **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts. | |
| 138 | +- **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query. | |
| 139 | +- **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K. | |
| 140 | +- **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears. | |
| 141 | +- **`Avg_Grade@10`** — Average relevance grade of the visible first page. | |
| 118 | 142 | |
| 119 | 143 | **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments). |
| 120 | 144 | |
| ... | ... | @@ -139,11 +163,11 @@ Default root: `artifacts/search_evaluation/` |
| 139 | 163 | |
| 140 | 164 | ## Web UI |
| 141 | 165 | |
| 142 | -Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits. | |
| 166 | +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits. | |
| 143 | 167 | |
| 144 | 168 | ## Batch reports |
| 145 | 169 | |
| 146 | -Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. | |
| 170 | +Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. | |
| 147 | 171 | |
| 148 | 172 | ## Caveats |
| 149 | 173 | ... | ... |
scripts/evaluation/eval_framework/constants.py
| ... | ... | @@ -14,8 +14,22 @@ RELEVANCE_IRRELEVANT = "Irrelevant" |
| 14 | 14 | |
| 15 | 15 | VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT}) |
| 16 | 16 | |
| 17 | -# Precision / MAP "positive" set (all non-irrelevant tiers) | |
| 17 | +# Useful label sets for binary diagnostic slices layered on top of graded ranking metrics. | |
| 18 | 18 | RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW}) |
| 19 | +RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH}) | |
| 20 | + | |
| 21 | +# Graded relevance for ranking evaluation. | |
| 22 | +# We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics. | |
| 23 | +RELEVANCE_GRADE_MAP = { | |
| 24 | + RELEVANCE_EXACT: 3, | |
| 25 | + RELEVANCE_HIGH: 2, | |
| 26 | + RELEVANCE_LOW: 1, | |
| 27 | + RELEVANCE_IRRELEVANT: 0, | |
| 28 | +} | |
| 29 | +RELEVANCE_GAIN_MAP = { | |
| 30 | + label: (2 ** grade) - 1 | |
| 31 | + for label, grade in RELEVANCE_GRADE_MAP.items() | |
| 32 | +} | |
| 19 | 33 | |
| 20 | 34 | _LEGACY_LABEL_MAP = { |
| 21 | 35 | "Exact": RELEVANCE_EXACT, | ... | ... |
scripts/evaluation/eval_framework/framework.py
| ... | ... | @@ -26,6 +26,7 @@ from .constants import ( |
| 26 | 26 | DEFAULT_RERANK_HIGH_THRESHOLD, |
| 27 | 27 | DEFAULT_SEARCH_RECALL_TOP_K, |
| 28 | 28 | RELEVANCE_EXACT, |
| 29 | + RELEVANCE_GAIN_MAP, | |
| 29 | 30 | RELEVANCE_HIGH, |
| 30 | 31 | RELEVANCE_IRRELEVANT, |
| 31 | 32 | RELEVANCE_LOW, |
| ... | ... | @@ -50,6 +51,18 @@ from .utils import ( |
| 50 | 51 | _log = logging.getLogger("search_eval.framework") |
| 51 | 52 | |
| 52 | 53 | |
| 54 | +def _metric_context_payload() -> Dict[str, Any]: | |
| 55 | + return { | |
| 56 | + "primary_metric": "NDCG@10", | |
| 57 | + "gain_scheme": dict(RELEVANCE_GAIN_MAP), | |
| 58 | + "notes": [ | |
| 59 | + "NDCG uses graded gains derived from the four relevance labels.", | |
| 60 | + "Strong metrics treat Exact Match and High Relevant as strong business positives.", | |
| 61 | + "Useful metrics treat any non-irrelevant item as useful recall coverage.", | |
| 62 | + ], | |
| 63 | + } | |
| 64 | + | |
| 65 | + | |
| 53 | 66 | def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: |
| 54 | 67 | """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``.""" |
| 55 | 68 | out: Dict[str, str] = {} |
| ... | ... | @@ -607,7 +620,7 @@ class SearchEvaluationFramework: |
| 607 | 620 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT |
| 608 | 621 | for item in search_labeled_results[:100] |
| 609 | 622 | ] |
| 610 | - metrics = compute_query_metrics(top100_labels) | |
| 623 | + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) | |
| 611 | 624 | output_dir = ensure_dir(self.artifact_root / "query_builds") |
| 612 | 625 | run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" |
| 613 | 626 | output_json_path = output_dir / f"{run_id}.json" |
| ... | ... | @@ -629,6 +642,7 @@ class SearchEvaluationFramework: |
| 629 | 642 | "pool_size": len(pool_docs), |
| 630 | 643 | }, |
| 631 | 644 | "metrics_top100": metrics, |
| 645 | + "metric_context": _metric_context_payload(), | |
| 632 | 646 | "search_results": search_labeled_results, |
| 633 | 647 | "full_rerank_top": rerank_top_results, |
| 634 | 648 | } |
| ... | ... | @@ -816,7 +830,7 @@ class SearchEvaluationFramework: |
| 816 | 830 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT |
| 817 | 831 | for item in search_labeled_results[:100] |
| 818 | 832 | ] |
| 819 | - metrics = compute_query_metrics(top100_labels) | |
| 833 | + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) | |
| 820 | 834 | output_dir = ensure_dir(self.artifact_root / "query_builds") |
| 821 | 835 | run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}" |
| 822 | 836 | output_json_path = output_dir / f"{run_id}.json" |
| ... | ... | @@ -838,6 +852,7 @@ class SearchEvaluationFramework: |
| 838 | 852 | "ordered_union_size": pool_docs_count, |
| 839 | 853 | }, |
| 840 | 854 | "metrics_top100": metrics, |
| 855 | + "metric_context": _metric_context_payload(), | |
| 841 | 856 | "search_results": search_labeled_results, |
| 842 | 857 | "full_rerank_top": rerank_top_results, |
| 843 | 858 | } |
| ... | ... | @@ -897,6 +912,10 @@ class SearchEvaluationFramework: |
| 897 | 912 | item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT |
| 898 | 913 | for item in labeled |
| 899 | 914 | ] |
| 915 | + ideal_labels = [ | |
| 916 | + label if label in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 917 | + for label in labels.values() | |
| 918 | + ] | |
| 900 | 919 | label_stats = self.store.get_query_label_stats(self.tenant_id, query) |
| 901 | 920 | rerank_scores = self.store.get_rerank_scores(self.tenant_id, query) |
| 902 | 921 | relevant_missing_ids = [ |
| ... | ... | @@ -947,12 +966,13 @@ class SearchEvaluationFramework: |
| 947 | 966 | if unlabeled_hits: |
| 948 | 967 | tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.") |
| 949 | 968 | if not missing_relevant: |
| 950 | - tips.append("No cached non-irrelevant products were missed by this recall set.") | |
| 969 | + tips.append("No cached judged useful products were missed by this recall set.") | |
| 951 | 970 | return { |
| 952 | 971 | "query": query, |
| 953 | 972 | "tenant_id": self.tenant_id, |
| 954 | 973 | "top_k": top_k, |
| 955 | - "metrics": compute_query_metrics(metric_labels), | |
| 974 | + "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels), | |
| 975 | + "metric_context": _metric_context_payload(), | |
| 956 | 976 | "results": labeled, |
| 957 | 977 | "missing_relevant": missing_relevant, |
| 958 | 978 | "label_stats": { |
| ... | ... | @@ -1004,12 +1024,12 @@ class SearchEvaluationFramework: |
| 1004 | 1024 | ) |
| 1005 | 1025 | m = live["metrics"] |
| 1006 | 1026 | _log.info( |
| 1007 | - "[batch-eval] (%s/%s) query=%r P@10=%s MAP_3=%s total_hits=%s", | |
| 1027 | + "[batch-eval] (%s/%s) query=%r NDCG@10=%s Strong_Precision@10=%s total_hits=%s", | |
| 1008 | 1028 | q_index, |
| 1009 | 1029 | total_q, |
| 1010 | 1030 | query, |
| 1011 | - m.get("P@10"), | |
| 1012 | - m.get("MAP_3"), | |
| 1031 | + m.get("NDCG@10"), | |
| 1032 | + m.get("Strong_Precision@10"), | |
| 1013 | 1033 | live.get("total"), |
| 1014 | 1034 | ) |
| 1015 | 1035 | aggregate = aggregate_metrics([item["metrics"] for item in per_query]) |
| ... | ... | @@ -1033,6 +1053,7 @@ class SearchEvaluationFramework: |
| 1033 | 1053 | "queries": list(queries), |
| 1034 | 1054 | "top_k": top_k, |
| 1035 | 1055 | "aggregate_metrics": aggregate, |
| 1056 | + "metric_context": _metric_context_payload(), | |
| 1036 | 1057 | "aggregate_distribution": aggregate_distribution, |
| 1037 | 1058 | "per_query": per_query, |
| 1038 | 1059 | "config_snapshot_path": str(config_snapshot_path), | ... | ... |
scripts/evaluation/eval_framework/metrics.py
| 1 | -"""IR metrics for labeled result lists.""" | |
| 1 | +"""Ranking metrics for graded e-commerce relevance labels.""" | |
| 2 | 2 | |
| 3 | 3 | from __future__ import annotations |
| 4 | 4 | |
| 5 | -from typing import Dict, Sequence | |
| 5 | +import math | |
| 6 | +from typing import Dict, Iterable, Sequence | |
| 6 | 7 | |
| 7 | -from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_NON_IRRELEVANT | |
| 8 | +from .constants import ( | |
| 9 | + RELEVANCE_EXACT, | |
| 10 | + RELEVANCE_GAIN_MAP, | |
| 11 | + RELEVANCE_GRADE_MAP, | |
| 12 | + RELEVANCE_HIGH, | |
| 13 | + RELEVANCE_IRRELEVANT, | |
| 14 | + RELEVANCE_LOW, | |
| 15 | + RELEVANCE_NON_IRRELEVANT, | |
| 16 | + RELEVANCE_STRONG, | |
| 17 | +) | |
| 8 | 18 | |
| 9 | 19 | |
| 10 | -def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -> float: | |
| 20 | +def _normalize_label(label: str) -> str: | |
| 21 | + if label in RELEVANCE_GRADE_MAP: | |
| 22 | + return label | |
| 23 | + return RELEVANCE_IRRELEVANT | |
| 24 | + | |
| 25 | + | |
| 26 | +def _gains_for_labels(labels: Sequence[str]) -> list[float]: | |
| 27 | + return [float(RELEVANCE_GAIN_MAP.get(_normalize_label(label), 0.0)) for label in labels] | |
| 28 | + | |
| 29 | + | |
| 30 | +def _binary_hits(labels: Sequence[str], relevant: Iterable[str]) -> list[int]: | |
| 31 | + relevant_set = set(relevant) | |
| 32 | + return [1 if _normalize_label(label) in relevant_set else 0 for label in labels] | |
| 33 | + | |
| 34 | + | |
| 35 | +def _precision_at_k_from_hits(hits: Sequence[int], k: int) -> float: | |
| 11 | 36 | if k <= 0: |
| 12 | 37 | return 0.0 |
| 13 | - sliced = list(labels[:k]) | |
| 38 | + sliced = list(hits[:k]) | |
| 14 | 39 | if not sliced: |
| 15 | 40 | return 0.0 |
| 16 | - rel = set(relevant) | |
| 17 | - hits = sum(1 for label in sliced if label in rel) | |
| 18 | - return hits / float(min(k, len(sliced))) | |
| 41 | + return sum(sliced) / float(len(sliced)) | |
| 42 | + | |
| 43 | + | |
| 44 | +def _success_at_k_from_hits(hits: Sequence[int], k: int) -> float: | |
| 45 | + if k <= 0: | |
| 46 | + return 0.0 | |
| 47 | + return 1.0 if any(hits[:k]) else 0.0 | |
| 48 | + | |
| 49 | + | |
| 50 | +def _reciprocal_rank_from_hits(hits: Sequence[int], k: int) -> float: | |
| 51 | + if k <= 0: | |
| 52 | + return 0.0 | |
| 53 | + for idx, hit in enumerate(hits[:k], start=1): | |
| 54 | + if hit: | |
| 55 | + return 1.0 / float(idx) | |
| 56 | + return 0.0 | |
| 19 | 57 | |
| 20 | 58 | |
| 21 | -def average_precision(labels: Sequence[str], relevant: Sequence[str]) -> float: | |
| 22 | - rel = set(relevant) | |
| 23 | - hit_count = 0 | |
| 24 | - precision_sum = 0.0 | |
| 25 | - for idx, label in enumerate(labels, start=1): | |
| 26 | - if label not in rel: | |
| 59 | +def _dcg_at_k(gains: Sequence[float], k: int) -> float: | |
| 60 | + if k <= 0: | |
| 61 | + return 0.0 | |
| 62 | + total = 0.0 | |
| 63 | + for idx, gain in enumerate(gains[:k], start=1): | |
| 64 | + if gain <= 0.0: | |
| 27 | 65 | continue |
| 28 | - hit_count += 1 | |
| 29 | - precision_sum += hit_count / idx | |
| 30 | - if hit_count == 0: | |
| 66 | + total += gain / math.log2(idx + 1.0) | |
| 67 | + return total | |
| 68 | + | |
| 69 | + | |
| 70 | +def _ndcg_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float: | |
| 71 | + actual_gains = _gains_for_labels(labels) | |
| 72 | + ideal_gains = sorted(_gains_for_labels(ideal_labels), reverse=True) | |
| 73 | + dcg = _dcg_at_k(actual_gains, k) | |
| 74 | + idcg = _dcg_at_k(ideal_gains, k) | |
| 75 | + if idcg <= 0.0: | |
| 76 | + return 0.0 | |
| 77 | + return dcg / idcg | |
| 78 | + | |
| 79 | + | |
| 80 | +def _gain_recall_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float: | |
| 81 | + ideal_total_gain = sum(_gains_for_labels(ideal_labels)) | |
| 82 | + if ideal_total_gain <= 0.0: | |
| 31 | 83 | return 0.0 |
| 32 | - return precision_sum / hit_count | |
| 84 | + actual_gain = sum(_gains_for_labels(labels[:k])) | |
| 85 | + return actual_gain / ideal_total_gain | |
| 33 | 86 | |
| 34 | 87 | |
| 35 | -def compute_query_metrics(labels: Sequence[str]) -> Dict[str, float]: | |
| 36 | - """P@k / MAP_3: Exact Match only. P@k_2_3 / MAP_2_3: any non-irrelevant tier (legacy metric names).""" | |
| 88 | +def _grade_avg_at_k(labels: Sequence[str], k: int) -> float: | |
| 89 | + if k <= 0: | |
| 90 | + return 0.0 | |
| 91 | + sliced = [_normalize_label(label) for label in labels[:k]] | |
| 92 | + if not sliced: | |
| 93 | + return 0.0 | |
| 94 | + return sum(float(RELEVANCE_GRADE_MAP.get(label, 0)) for label in sliced) / float(len(sliced)) | |
| 95 | + | |
| 96 | + | |
| 97 | +def compute_query_metrics( | |
| 98 | + labels: Sequence[str], | |
| 99 | + *, | |
| 100 | + ideal_labels: Sequence[str] | None = None, | |
| 101 | +) -> Dict[str, float]: | |
| 102 | + """Compute graded ranking metrics plus binary diagnostic slices. | |
| 103 | + | |
| 104 | + `labels` are the ranked results returned by search. | |
| 105 | + `ideal_labels` is the judged label pool for the same query; when omitted we fall back | |
| 106 | + to the retrieved labels, which still keeps the metrics well-defined. | |
| 107 | + """ | |
| 108 | + | |
| 109 | + ideal = list(ideal_labels) if ideal_labels is not None else list(labels) | |
| 37 | 110 | metrics: Dict[str, float] = {} |
| 38 | - non_irrel = list(RELEVANCE_NON_IRRELEVANT) | |
| 111 | + | |
| 112 | + exact_hits = _binary_hits(labels, [RELEVANCE_EXACT]) | |
| 113 | + strong_hits = _binary_hits(labels, RELEVANCE_STRONG) | |
| 114 | + useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT) | |
| 115 | + | |
| 39 | 116 | for k in (5, 10, 20, 50): |
| 40 | - metrics[f"P@{k}"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT]), 6) | |
| 41 | - metrics[f"P@{k}_2_3"] = round(precision_at_k(labels, k, non_irrel), 6) | |
| 42 | - metrics["MAP_3"] = round(average_precision(labels, [RELEVANCE_EXACT]), 6) | |
| 43 | - metrics["MAP_2_3"] = round(average_precision(labels, non_irrel), 6) | |
| 117 | + metrics[f"NDCG@{k}"] = round(_ndcg_at_k(labels, ideal, k), 6) | |
| 118 | + for k in (5, 10, 20): | |
| 119 | + metrics[f"Exact_Precision@{k}"] = round(_precision_at_k_from_hits(exact_hits, k), 6) | |
| 120 | + metrics[f"Strong_Precision@{k}"] = round(_precision_at_k_from_hits(strong_hits, k), 6) | |
| 121 | + for k in (10, 20, 50): | |
| 122 | + metrics[f"Useful_Precision@{k}"] = round(_precision_at_k_from_hits(useful_hits, k), 6) | |
| 123 | + metrics[f"Gain_Recall@{k}"] = round(_gain_recall_at_k(labels, ideal, k), 6) | |
| 124 | + for k in (5, 10): | |
| 125 | + metrics[f"Exact_Success@{k}"] = round(_success_at_k_from_hits(exact_hits, k), 6) | |
| 126 | + metrics[f"Strong_Success@{k}"] = round(_success_at_k_from_hits(strong_hits, k), 6) | |
| 127 | + metrics["MRR_Exact@10"] = round(_reciprocal_rank_from_hits(exact_hits, 10), 6) | |
| 128 | + metrics["MRR_Strong@10"] = round(_reciprocal_rank_from_hits(strong_hits, 10), 6) | |
| 129 | + metrics["Avg_Grade@10"] = round(_grade_avg_at_k(labels, 10), 6) | |
| 44 | 130 | return metrics |
| 45 | 131 | |
| 46 | 132 | |
| 47 | 133 | def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, float]: |
| 48 | 134 | if not metric_items: |
| 49 | 135 | return {} |
| 50 | - keys = sorted(metric_items[0].keys()) | |
| 136 | + all_keys = sorted({key for item in metric_items for key in item.keys()}) | |
| 51 | 137 | return { |
| 52 | 138 | key: round(sum(float(item.get(key, 0.0)) for item in metric_items) / len(metric_items), 6) |
| 53 | - for key in keys | |
| 139 | + for key in all_keys | |
| 54 | 140 | } |
| 55 | 141 | |
| 56 | 142 | ... | ... |
scripts/evaluation/eval_framework/reports.py
| ... | ... | @@ -7,6 +7,19 @@ from typing import Any, Dict |
| 7 | 7 | from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW |
| 8 | 8 | |
| 9 | 9 | |
| 10 | +def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -> None: | |
| 11 | + primary_keys = ("NDCG@5", "NDCG@10", "NDCG@20", "Exact_Precision@10", "Strong_Precision@10", "Gain_Recall@50") | |
| 12 | + included = set() | |
| 13 | + for key in primary_keys: | |
| 14 | + if key in metrics: | |
| 15 | + lines.append(f"- {key}: {metrics[key]}") | |
| 16 | + included.add(key) | |
| 17 | + for key, value in sorted(metrics.items()): | |
| 18 | + if key in included: | |
| 19 | + continue | |
| 20 | + lines.append(f"- {key}: {value}") | |
| 21 | + | |
| 22 | + | |
| 10 | 23 | def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 11 | 24 | lines = [ |
| 12 | 25 | "# Search Batch Evaluation", |
| ... | ... | @@ -20,8 +33,16 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 20 | 33 | "## Aggregate Metrics", |
| 21 | 34 | "", |
| 22 | 35 | ] |
| 23 | - for key, value in sorted((payload.get("aggregate_metrics") or {}).items()): | |
| 24 | - lines.append(f"- {key}: {value}") | |
| 36 | + metric_context = payload.get("metric_context") or {} | |
| 37 | + if metric_context: | |
| 38 | + lines.extend( | |
| 39 | + [ | |
| 40 | + f"- Primary metric: {metric_context.get('primary_metric', 'N/A')}", | |
| 41 | + f"- Gain scheme: {metric_context.get('gain_scheme', {})}", | |
| 42 | + "", | |
| 43 | + ] | |
| 44 | + ) | |
| 45 | + _append_metric_block(lines, payload.get("aggregate_metrics") or {}) | |
| 25 | 46 | distribution = payload.get("aggregate_distribution") or {} |
| 26 | 47 | if distribution: |
| 27 | 48 | lines.extend( |
| ... | ... | @@ -39,8 +60,7 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 39 | 60 | for item in payload.get("per_query") or []: |
| 40 | 61 | lines.append(f"### {item['query']}") |
| 41 | 62 | lines.append("") |
| 42 | - for key, value in sorted((item.get("metrics") or {}).items()): | |
| 43 | - lines.append(f"- {key}: {value}") | |
| 63 | + _append_metric_block(lines, item.get("metrics") or {}) | |
| 44 | 64 | distribution = item.get("distribution") or {} |
| 45 | 65 | lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}") |
| 46 | 66 | lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}") | ... | ... |
scripts/evaluation/eval_framework/static/eval_web.css
| ... | ... | @@ -6,7 +6,8 @@ |
| 6 | 6 | --line: #ddd4c6; |
| 7 | 7 | --accent: #0f766e; |
| 8 | 8 | --exact: #0f766e; |
| 9 | - --partial: #b7791f; | |
| 9 | + --high: #b7791f; | |
| 10 | + --low: #3b82a0; | |
| 10 | 11 | --irrelevant: #b42318; |
| 11 | 12 | } |
| 12 | 13 | body { margin: 0; font-family: "IBM Plex Sans", "Segoe UI", sans-serif; color: var(--ink); background: |
| ... | ... | @@ -29,6 +30,12 @@ |
| 29 | 30 | button { border: 0; background: var(--accent); color: white; padding: 12px 16px; border-radius: 14px; cursor: pointer; font-weight: 600; } |
| 30 | 31 | button.secondary { background: #d9e6e3; color: #12433d; } |
| 31 | 32 | .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(170px, 1fr)); gap: 12px; margin-bottom: 16px; } |
| 33 | + .metric-context { margin: 0 0 12px; line-height: 1.5; } | |
| 34 | + .metric-section { margin-bottom: 18px; } | |
| 35 | + .metric-section-head { display: flex; align-items: baseline; justify-content: space-between; gap: 12px; margin-bottom: 10px; } | |
| 36 | + .metric-section-head h3 { margin: 0; font-size: 14px; color: #12433d; } | |
| 37 | + .metric-section-head p { margin: 0; color: var(--muted); font-size: 12px; } | |
| 38 | + .metric-grid { margin-bottom: 0; } | |
| 32 | 39 | .metric { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; } |
| 33 | 40 | .metric .label { font-size: 12px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; } |
| 34 | 41 | .metric .value { font-size: 24px; font-weight: 700; margin-top: 4px; } |
| ... | ... | @@ -36,8 +43,8 @@ |
| 36 | 43 | .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; } |
| 37 | 44 | .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; } |
| 38 | 45 | .label-exact-match { background: var(--exact); } |
| 39 | - .label-high-relevant { background: var(--partial); } | |
| 40 | - .label-low-relevant { background: #6b5b95; } | |
| 46 | + .label-high-relevant { background: var(--high); } | |
| 47 | + .label-low-relevant { background: var(--low); } | |
| 41 | 48 | .label-irrelevant { background: var(--irrelevant); } |
| 42 | 49 | .badge-unknown { background: #637381; } |
| 43 | 50 | .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } |
| ... | ... | @@ -91,3 +98,13 @@ |
| 91 | 98 | .report-modal-body.report-modal-loading, .report-modal-body.report-modal-error { color: var(--muted); font-style: italic; } |
| 92 | 99 | .tips { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; line-height: 1.6; } |
| 93 | 100 | .tip { margin-bottom: 6px; color: var(--muted); } |
| 101 | + @media (max-width: 960px) { | |
| 102 | + .app { grid-template-columns: 1fr; } | |
| 103 | + .sidebar { border-right: 0; border-bottom: 1px solid var(--line); } | |
| 104 | + .metric-section-head { flex-direction: column; align-items: flex-start; } | |
| 105 | + } | |
| 106 | + @media (max-width: 640px) { | |
| 107 | + .main, .sidebar { padding: 16px; } | |
| 108 | + .result { grid-template-columns: 1fr; } | |
| 109 | + .thumb { width: 100%; max-width: 180px; height: auto; aspect-ratio: 1 / 1; } | |
| 110 | + } | ... | ... |
scripts/evaluation/eval_framework/static/eval_web.js
| 1 | - async function fetchJSON(url, options) { | |
| 2 | - const res = await fetch(url, options); | |
| 3 | - if (!res.ok) throw new Error(await res.text()); | |
| 4 | - return await res.json(); | |
| 5 | - } | |
| 6 | - function renderMetrics(metrics) { | |
| 7 | - const root = document.getElementById('metrics'); | |
| 8 | - root.innerHTML = ''; | |
| 9 | - Object.entries(metrics || {}).forEach(([key, value]) => { | |
| 10 | - const card = document.createElement('div'); | |
| 11 | - card.className = 'metric'; | |
| 12 | - card.innerHTML = `<div class="label">${key}</div><div class="value">${value}</div>`; | |
| 13 | - root.appendChild(card); | |
| 14 | - }); | |
| 15 | - } | |
| 16 | - function labelBadgeClass(label) { | |
| 17 | - if (!label || label === 'Unknown') return 'badge-unknown'; | |
| 18 | - return 'label-' + String(label).toLowerCase().replace(/\s+/g, '-'); | |
| 19 | - } | |
| 20 | - function renderResults(results, rootId='results', showRank=true) { | |
| 21 | - const mount = document.getElementById(rootId); | |
| 22 | - mount.innerHTML = ''; | |
| 23 | - (results || []).forEach(item => { | |
| 24 | - const label = item.label || 'Unknown'; | |
| 25 | - const box = document.createElement('div'); | |
| 26 | - box.className = 'result'; | |
| 27 | - box.innerHTML = ` | |
| 28 | - <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || '-'}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : 'not recalled')}</div></div> | |
| 29 | - <img class="thumb" src="${item.image_url || ''}" alt="" /> | |
| 30 | - <div> | |
| 31 | - <div class="title">${item.title || ''}</div> | |
| 32 | - ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''} | |
| 33 | - <div class="options"> | |
| 34 | - <div>${(item.option_values || [])[0] || ''}</div> | |
| 35 | - <div>${(item.option_values || [])[1] || ''}</div> | |
| 36 | - <div>${(item.option_values || [])[2] || ''}</div> | |
| 37 | - </div> | |
| 38 | - </div>`; | |
| 39 | - mount.appendChild(box); | |
| 40 | - }); | |
| 41 | - if (!(results || []).length) { | |
| 42 | - mount.innerHTML = '<div class="muted">None.</div>'; | |
| 43 | - } | |
| 44 | - } | |
| 45 | - function renderTips(data) { | |
| 46 | - const root = document.getElementById('tips'); | |
| 47 | - const tips = [...(data.tips || [])]; | |
| 48 | - const stats = data.label_stats || {}; | |
| 49 | - tips.unshift(`Cached labels for query: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed (non-irrelevant): ${stats.missing_relevant_count || 0} — Exact: ${stats.missing_exact_count || 0}, High: ${stats.missing_high_count || 0}, Low: ${stats.missing_low_count || 0}.`); | |
| 50 | - root.innerHTML = tips.map(text => `<div class="tip">${text}</div>`).join(''); | |
| 51 | - } | |
| 52 | - async function loadQueries() { | |
| 53 | - const data = await fetchJSON('/api/queries'); | |
| 54 | - const root = document.getElementById('queryList'); | |
| 55 | - root.innerHTML = ''; | |
| 56 | - data.queries.forEach(query => { | |
| 57 | - const btn = document.createElement('button'); | |
| 58 | - btn.className = 'query-item'; | |
| 59 | - btn.textContent = query; | |
| 60 | - btn.onclick = () => { | |
| 61 | - document.getElementById('queryInput').value = query; | |
| 62 | - runSingle(); | |
| 63 | - }; | |
| 64 | - root.appendChild(btn); | |
| 65 | - }); | |
| 66 | - } | |
| 67 | - function fmtMetric(m, key, digits) { | |
| 68 | - const v = m && m[key]; | |
| 69 | - if (v == null || Number.isNaN(Number(v))) return null; | |
| 70 | - const n = Number(v); | |
| 71 | - return n.toFixed(digits); | |
| 72 | - } | |
| 73 | - function historySummaryHtml(meta) { | |
| 74 | - const m = meta && meta.aggregate_metrics; | |
| 75 | - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; | |
| 76 | - const parts = []; | |
| 77 | - if (nq != null) parts.push(`<span>Queries</span> ${nq}`); | |
| 78 | - const p10 = fmtMetric(m, 'P@10', 3); | |
| 79 | - const p52 = fmtMetric(m, 'P@5_2_3', 3); | |
| 80 | - const map3 = fmtMetric(m, 'MAP_3', 3); | |
| 81 | - if (p10) parts.push(`<span>P@10</span> ${p10}`); | |
| 82 | - if (p52) parts.push(`<span>P@5_2_3</span> ${p52}`); | |
| 83 | - if (map3) parts.push(`<span>MAP_3</span> ${map3}`); | |
| 84 | - if (!parts.length) return ''; | |
| 85 | - return `<div class="hstats">${parts.join(' · ')}</div>`; | |
| 86 | - } | |
| 87 | - async function loadHistory() { | |
| 88 | - const data = await fetchJSON('/api/history'); | |
| 89 | - const root = document.getElementById('history'); | |
| 90 | - root.classList.remove('muted'); | |
| 91 | - const items = data.history || []; | |
| 92 | - if (!items.length) { | |
| 93 | - root.innerHTML = '<span class="muted">No history yet.</span>'; | |
| 94 | - return; | |
| 95 | - } | |
| 96 | - root.innerHTML = `<div class="history-list"></div>`; | |
| 97 | - const list = root.querySelector('.history-list'); | |
| 98 | - items.forEach(item => { | |
| 99 | - const btn = document.createElement('button'); | |
| 100 | - btn.type = 'button'; | |
| 101 | - btn.className = 'history-item'; | |
| 102 | - btn.setAttribute('aria-label', `Open report ${item.batch_id}`); | |
| 103 | - const sum = historySummaryHtml(item.metadata); | |
| 104 | - btn.innerHTML = `<div class="hid">${item.batch_id}</div> | |
| 105 | - <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`; | |
| 106 | - btn.onclick = () => openBatchReport(item.batch_id); | |
| 107 | - list.appendChild(btn); | |
| 108 | - }); | |
| 109 | - } | |
| 110 | - let _lastReportPath = ''; | |
| 111 | - function closeReportModal() { | |
| 112 | - const el = document.getElementById('reportModal'); | |
| 113 | - el.classList.remove('is-open'); | |
| 114 | - el.setAttribute('aria-hidden', 'true'); | |
| 115 | - document.getElementById('reportModalBody').innerHTML = ''; | |
| 116 | - document.getElementById('reportModalMeta').textContent = ''; | |
| 117 | - } | |
| 118 | - async function openBatchReport(batchId) { | |
| 119 | - const el = document.getElementById('reportModal'); | |
| 120 | - const body = document.getElementById('reportModalBody'); | |
| 121 | - const metaEl = document.getElementById('reportModalMeta'); | |
| 122 | - const titleEl = document.getElementById('reportModalTitle'); | |
| 123 | - el.classList.add('is-open'); | |
| 124 | - el.setAttribute('aria-hidden', 'false'); | |
| 125 | - titleEl.textContent = batchId; | |
| 126 | - metaEl.textContent = ''; | |
| 127 | - body.className = 'report-modal-body batch-report-md report-modal-loading'; | |
| 128 | - body.textContent = 'Loading report…'; | |
| 129 | - try { | |
| 130 | - const rep = await fetchJSON('/api/history/' + encodeURIComponent(batchId) + '/report'); | |
| 131 | - _lastReportPath = rep.report_markdown_path || ''; | |
| 132 | - metaEl.textContent = rep.report_markdown_path || ''; | |
| 133 | - const raw = marked.parse(rep.markdown || '', { gfm: true }); | |
| 134 | - const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } }); | |
| 135 | - body.className = 'report-modal-body batch-report-md'; | |
| 136 | - body.innerHTML = safe; | |
| 137 | - } catch (e) { | |
| 138 | - body.className = 'report-modal-body report-modal-error'; | |
| 139 | - body.textContent = (e && e.message) ? e.message : String(e); | |
| 140 | - } | |
| 141 | - } | |
| 142 | - document.getElementById('reportModal').addEventListener('click', (ev) => { | |
| 143 | - if (ev.target && ev.target.getAttribute('data-close-report') === '1') closeReportModal(); | |
| 1 | +async function fetchJSON(url, options) { | |
| 2 | + const res = await fetch(url, options); | |
| 3 | + if (!res.ok) throw new Error(await res.text()); | |
| 4 | + return await res.json(); | |
| 5 | +} | |
| 6 | + | |
| 7 | +function fmtNumber(value, digits = 3) { | |
| 8 | + if (value == null || Number.isNaN(Number(value))) return "-"; | |
| 9 | + return Number(value).toFixed(digits); | |
| 10 | +} | |
| 11 | + | |
| 12 | +function metricSections(metrics) { | |
| 13 | + const groups = [ | |
| 14 | + { | |
| 15 | + title: "Primary Ranking", | |
| 16 | + keys: ["NDCG@5", "NDCG@10", "NDCG@20", "NDCG@50"], | |
| 17 | + description: "Graded ranking quality across the four relevance tiers.", | |
| 18 | + }, | |
| 19 | + { | |
| 20 | + title: "Top Slot Quality", | |
| 21 | + keys: ["Exact_Precision@5", "Exact_Precision@10", "Strong_Precision@5", "Strong_Precision@10", "Strong_Precision@20"], | |
| 22 | + description: "How much of the visible top rank is exact or strong business relevance.", | |
| 23 | + }, | |
| 24 | + { | |
| 25 | + title: "Recall Coverage", | |
| 26 | + keys: ["Useful_Precision@10", "Useful_Precision@20", "Useful_Precision@50", "Gain_Recall@10", "Gain_Recall@20", "Gain_Recall@50"], | |
| 27 | + description: "How much judged relevance is captured in the returned list.", | |
| 28 | + }, | |
| 29 | + { | |
| 30 | + title: "First Good Result", | |
| 31 | + keys: ["Exact_Success@5", "Exact_Success@10", "Strong_Success@5", "Strong_Success@10", "MRR_Exact@10", "MRR_Strong@10", "Avg_Grade@10"], | |
| 32 | + description: "Whether users see a good result early and how good the top page feels overall.", | |
| 33 | + }, | |
| 34 | + ]; | |
| 35 | + const seen = new Set(); | |
| 36 | + return groups | |
| 37 | + .map((group) => { | |
| 38 | + const items = group.keys | |
| 39 | + .filter((key) => metrics && Object.prototype.hasOwnProperty.call(metrics, key)) | |
| 40 | + .map((key) => { | |
| 41 | + seen.add(key); | |
| 42 | + return [key, metrics[key]]; | |
| 43 | + }); | |
| 44 | + return { ...group, items }; | |
| 45 | + }) | |
| 46 | + .filter((group) => group.items.length) | |
| 47 | + .concat( | |
| 48 | + (() => { | |
| 49 | + const rest = Object.entries(metrics || {}).filter(([key]) => !seen.has(key)); | |
| 50 | + return rest.length | |
| 51 | + ? [{ title: "Other Metrics", description: "", items: rest }] | |
| 52 | + : []; | |
| 53 | + })() | |
| 54 | + ); | |
| 55 | +} | |
| 56 | + | |
| 57 | +function renderMetrics(metrics, metricContext) { | |
| 58 | + const root = document.getElementById("metrics"); | |
| 59 | + root.innerHTML = ""; | |
| 60 | + const ctx = document.getElementById("metricContext"); | |
| 61 | + const gainScheme = metricContext && metricContext.gain_scheme; | |
| 62 | + const primary = metricContext && metricContext.primary_metric; | |
| 63 | + ctx.textContent = primary | |
| 64 | + ? `Primary metric: ${primary}. Gain scheme: ${Object.entries(gainScheme || {}).map(([label, gain]) => `${label}=${gain}`).join(", ")}.` | |
| 65 | + : ""; | |
| 66 | + | |
| 67 | + metricSections(metrics || {}).forEach((section) => { | |
| 68 | + const wrap = document.createElement("section"); | |
| 69 | + wrap.className = "metric-section"; | |
| 70 | + wrap.innerHTML = ` | |
| 71 | + <div class="metric-section-head"> | |
| 72 | + <h3>${section.title}</h3> | |
| 73 | + ${section.description ? `<p>${section.description}</p>` : ""} | |
| 74 | + </div> | |
| 75 | + <div class="grid metric-grid"></div> | |
| 76 | + `; | |
| 77 | + const grid = wrap.querySelector(".metric-grid"); | |
| 78 | + section.items.forEach(([key, value]) => { | |
| 79 | + const card = document.createElement("div"); | |
| 80 | + card.className = "metric"; | |
| 81 | + card.innerHTML = `<div class="label">${key}</div><div class="value">${fmtNumber(value)}</div>`; | |
| 82 | + grid.appendChild(card); | |
| 144 | 83 | }); |
| 145 | - document.addEventListener('keydown', (ev) => { | |
| 146 | - if (ev.key === 'Escape') closeReportModal(); | |
| 147 | - }); | |
| 148 | - document.getElementById('reportCopyPath').addEventListener('click', async () => { | |
| 149 | - if (!_lastReportPath) return; | |
| 150 | - try { | |
| 151 | - await navigator.clipboard.writeText(_lastReportPath); | |
| 152 | - } catch (_) {} | |
| 153 | - }); | |
| 154 | - async function runSingle() { | |
| 155 | - const query = document.getElementById('queryInput').value.trim(); | |
| 156 | - if (!query) return; | |
| 157 | - document.getElementById('status').textContent = `Evaluating "${query}"...`; | |
| 158 | - const data = await fetchJSON('/api/search-eval', { | |
| 159 | - method: 'POST', | |
| 160 | - headers: {'Content-Type': 'application/json'}, | |
| 161 | - body: JSON.stringify({query, top_k: 100, auto_annotate: false}) | |
| 162 | - }); | |
| 163 | - document.getElementById('status').textContent = `Done. total=${data.total}`; | |
| 164 | - renderMetrics(data.metrics); | |
| 165 | - renderResults(data.results, 'results', true); | |
| 166 | - renderResults(data.missing_relevant, 'missingRelevant', false); | |
| 167 | - renderTips(data); | |
| 168 | - loadHistory(); | |
| 169 | - } | |
| 170 | - async function runBatch() { | |
| 171 | - document.getElementById('status').textContent = 'Running batch evaluation...'; | |
| 172 | - const data = await fetchJSON('/api/batch-eval', { | |
| 173 | - method: 'POST', | |
| 174 | - headers: {'Content-Type': 'application/json'}, | |
| 175 | - body: JSON.stringify({top_k: 100, auto_annotate: false}) | |
| 176 | - }); | |
| 177 | - document.getElementById('status').textContent = `Batch done. report=${data.batch_id}`; | |
| 178 | - renderMetrics(data.aggregate_metrics); | |
| 179 | - renderResults([], 'results', true); | |
| 180 | - renderResults([], 'missingRelevant', false); | |
| 181 | - document.getElementById('tips').innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>'; | |
| 182 | - loadHistory(); | |
| 183 | - } | |
| 184 | - loadQueries(); | |
| 185 | - loadHistory(); | |
| 186 | - | |
| 84 | + root.appendChild(wrap); | |
| 85 | + }); | |
| 86 | +} | |
| 87 | + | |
| 88 | +function labelBadgeClass(label) { | |
| 89 | + if (!label || label === "Unknown") return "badge-unknown"; | |
| 90 | + return "label-" + String(label).toLowerCase().replace(/\s+/g, "-"); | |
| 91 | +} | |
| 92 | + | |
| 93 | +function renderResults(results, rootId = "results", showRank = true) { | |
| 94 | + const mount = document.getElementById(rootId); | |
| 95 | + mount.innerHTML = ""; | |
| 96 | + (results || []).forEach((item) => { | |
| 97 | + const label = item.label || "Unknown"; | |
| 98 | + const box = document.createElement("div"); | |
| 99 | + box.className = "result"; | |
| 100 | + box.innerHTML = ` | |
| 101 | + <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || "-"}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : "not recalled")}</div></div> | |
| 102 | + <img class="thumb" src="${item.image_url || ""}" alt="" /> | |
| 103 | + <div> | |
| 104 | + <div class="title">${item.title || ""}</div> | |
| 105 | + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ""} | |
| 106 | + <div class="options"> | |
| 107 | + <div>${(item.option_values || [])[0] || ""}</div> | |
| 108 | + <div>${(item.option_values || [])[1] || ""}</div> | |
| 109 | + <div>${(item.option_values || [])[2] || ""}</div> | |
| 110 | + </div> | |
| 111 | + </div>`; | |
| 112 | + mount.appendChild(box); | |
| 113 | + }); | |
| 114 | + if (!(results || []).length) { | |
| 115 | + mount.innerHTML = '<div class="muted">None.</div>'; | |
| 116 | + } | |
| 117 | +} | |
| 118 | + | |
| 119 | +function renderTips(data) { | |
| 120 | + const root = document.getElementById("tips"); | |
| 121 | + const tips = [...(data.tips || [])]; | |
| 122 | + const stats = data.label_stats || {}; | |
| 123 | + tips.unshift( | |
| 124 | + `Cached labels: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed judged useful results: ${stats.missing_relevant_count || 0} (Exact ${stats.missing_exact_count || 0}, High ${stats.missing_high_count || 0}, Low ${stats.missing_low_count || 0}).` | |
| 125 | + ); | |
| 126 | + root.innerHTML = tips.map((text) => `<div class="tip">${text}</div>`).join(""); | |
| 127 | +} | |
| 128 | + | |
| 129 | +async function loadQueries() { | |
| 130 | + const data = await fetchJSON("/api/queries"); | |
| 131 | + const root = document.getElementById("queryList"); | |
| 132 | + root.innerHTML = ""; | |
| 133 | + data.queries.forEach((query) => { | |
| 134 | + const btn = document.createElement("button"); | |
| 135 | + btn.className = "query-item"; | |
| 136 | + btn.textContent = query; | |
| 137 | + btn.onclick = () => { | |
| 138 | + document.getElementById("queryInput").value = query; | |
| 139 | + runSingle(); | |
| 140 | + }; | |
| 141 | + root.appendChild(btn); | |
| 142 | + }); | |
| 143 | +} | |
| 144 | + | |
| 145 | +function historySummaryHtml(meta) { | |
| 146 | + const m = meta && meta.aggregate_metrics; | |
| 147 | + const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; | |
| 148 | + const parts = []; | |
| 149 | + if (nq != null) parts.push(`<span>Queries</span> ${nq}`); | |
| 150 | + if (m && m["NDCG@10"] != null) parts.push(`<span>NDCG@10</span> ${fmtNumber(m["NDCG@10"])}`); | |
| 151 | + if (m && m["Strong_Precision@10"] != null) parts.push(`<span>Strong@10</span> ${fmtNumber(m["Strong_Precision@10"])}`); | |
| 152 | + if (m && m["Gain_Recall@50"] != null) parts.push(`<span>Gain Recall@50</span> ${fmtNumber(m["Gain_Recall@50"])}`); | |
| 153 | + if (!parts.length) return ""; | |
| 154 | + return `<div class="hstats">${parts.join(" · ")}</div>`; | |
| 155 | +} | |
| 156 | + | |
| 157 | +async function loadHistory() { | |
| 158 | + const data = await fetchJSON("/api/history"); | |
| 159 | + const root = document.getElementById("history"); | |
| 160 | + root.classList.remove("muted"); | |
| 161 | + const items = data.history || []; | |
| 162 | + if (!items.length) { | |
| 163 | + root.innerHTML = '<span class="muted">No history yet.</span>'; | |
| 164 | + return; | |
| 165 | + } | |
| 166 | + root.innerHTML = `<div class="history-list"></div>`; | |
| 167 | + const list = root.querySelector(".history-list"); | |
| 168 | + items.forEach((item) => { | |
| 169 | + const btn = document.createElement("button"); | |
| 170 | + btn.type = "button"; | |
| 171 | + btn.className = "history-item"; | |
| 172 | + btn.setAttribute("aria-label", `Open report ${item.batch_id}`); | |
| 173 | + const sum = historySummaryHtml(item.metadata); | |
| 174 | + btn.innerHTML = `<div class="hid">${item.batch_id}</div> | |
| 175 | + <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`; | |
| 176 | + btn.onclick = () => openBatchReport(item.batch_id); | |
| 177 | + list.appendChild(btn); | |
| 178 | + }); | |
| 179 | +} | |
| 180 | + | |
| 181 | +let _lastReportPath = ""; | |
| 182 | + | |
| 183 | +function closeReportModal() { | |
| 184 | + const el = document.getElementById("reportModal"); | |
| 185 | + el.classList.remove("is-open"); | |
| 186 | + el.setAttribute("aria-hidden", "true"); | |
| 187 | + document.getElementById("reportModalBody").innerHTML = ""; | |
| 188 | + document.getElementById("reportModalMeta").textContent = ""; | |
| 189 | +} | |
| 190 | + | |
| 191 | +async function openBatchReport(batchId) { | |
| 192 | + const el = document.getElementById("reportModal"); | |
| 193 | + const body = document.getElementById("reportModalBody"); | |
| 194 | + const metaEl = document.getElementById("reportModalMeta"); | |
| 195 | + const titleEl = document.getElementById("reportModalTitle"); | |
| 196 | + el.classList.add("is-open"); | |
| 197 | + el.setAttribute("aria-hidden", "false"); | |
| 198 | + titleEl.textContent = batchId; | |
| 199 | + metaEl.textContent = ""; | |
| 200 | + body.className = "report-modal-body batch-report-md report-modal-loading"; | |
| 201 | + body.textContent = "Loading report…"; | |
| 202 | + try { | |
| 203 | + const rep = await fetchJSON("/api/history/" + encodeURIComponent(batchId) + "/report"); | |
| 204 | + _lastReportPath = rep.report_markdown_path || ""; | |
| 205 | + metaEl.textContent = rep.report_markdown_path || ""; | |
| 206 | + const raw = marked.parse(rep.markdown || "", { gfm: true }); | |
| 207 | + const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } }); | |
| 208 | + body.className = "report-modal-body batch-report-md"; | |
| 209 | + body.innerHTML = safe; | |
| 210 | + } catch (e) { | |
| 211 | + body.className = "report-modal-body report-modal-error"; | |
| 212 | + body.textContent = e && e.message ? e.message : String(e); | |
| 213 | + } | |
| 214 | +} | |
| 215 | + | |
| 216 | +document.getElementById("reportModal").addEventListener("click", (ev) => { | |
| 217 | + if (ev.target && ev.target.getAttribute("data-close-report") === "1") closeReportModal(); | |
| 218 | +}); | |
| 219 | + | |
| 220 | +document.addEventListener("keydown", (ev) => { | |
| 221 | + if (ev.key === "Escape") closeReportModal(); | |
| 222 | +}); | |
| 223 | + | |
| 224 | +document.getElementById("reportCopyPath").addEventListener("click", async () => { | |
| 225 | + if (!_lastReportPath) return; | |
| 226 | + try { | |
| 227 | + await navigator.clipboard.writeText(_lastReportPath); | |
| 228 | + } catch (_) {} | |
| 229 | +}); | |
| 230 | + | |
| 231 | +async function runSingle() { | |
| 232 | + const query = document.getElementById("queryInput").value.trim(); | |
| 233 | + if (!query) return; | |
| 234 | + document.getElementById("status").textContent = `Evaluating "${query}"...`; | |
| 235 | + const data = await fetchJSON("/api/search-eval", { | |
| 236 | + method: "POST", | |
| 237 | + headers: { "Content-Type": "application/json" }, | |
| 238 | + body: JSON.stringify({ query, top_k: 100, auto_annotate: false }), | |
| 239 | + }); | |
| 240 | + document.getElementById("status").textContent = `Done. total=${data.total}`; | |
| 241 | + renderMetrics(data.metrics, data.metric_context); | |
| 242 | + renderResults(data.results, "results", true); | |
| 243 | + renderResults(data.missing_relevant, "missingRelevant", false); | |
| 244 | + renderTips(data); | |
| 245 | + loadHistory(); | |
| 246 | +} | |
| 247 | + | |
| 248 | +async function runBatch() { | |
| 249 | + document.getElementById("status").textContent = "Running batch evaluation..."; | |
| 250 | + const data = await fetchJSON("/api/batch-eval", { | |
| 251 | + method: "POST", | |
| 252 | + headers: { "Content-Type": "application/json" }, | |
| 253 | + body: JSON.stringify({ top_k: 100, auto_annotate: false }), | |
| 254 | + }); | |
| 255 | + document.getElementById("status").textContent = `Batch done. report=${data.batch_id}`; | |
| 256 | + renderMetrics(data.aggregate_metrics, data.metric_context); | |
| 257 | + renderResults([], "results", true); | |
| 258 | + renderResults([], "missingRelevant", false); | |
| 259 | + document.getElementById("tips").innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>'; | |
| 260 | + loadHistory(); | |
| 261 | +} | |
| 262 | + | |
| 263 | +loadQueries(); | |
| 264 | +loadHistory(); | ... | ... |
scripts/evaluation/eval_framework/static/index.html
| ... | ... | @@ -30,6 +30,7 @@ |
| 30 | 30 | <div id="status" class="muted section"></div> |
| 31 | 31 | <section class="section"> |
| 32 | 32 | <h2>Metrics</h2> |
| 33 | + <p id="metricContext" class="muted metric-context"></p> | |
| 33 | 34 | <div id="metrics" class="grid"></div> |
| 34 | 35 | </section> |
| 35 | 36 | <section class="section"> |
| ... | ... | @@ -37,7 +38,7 @@ |
| 37 | 38 | <div id="results" class="results"></div> |
| 38 | 39 | </section> |
| 39 | 40 | <section class="section"> |
| 40 | - <h2>Missed non-irrelevant (cached)</h2> | |
| 41 | + <h2>Missed judged useful results</h2> | |
| 41 | 42 | <div id="missingRelevant" class="results"></div> |
| 42 | 43 | </section> |
| 43 | 44 | <section class="section"> |
| ... | ... | @@ -67,4 +68,4 @@ |
| 67 | 68 | <script src="https://cdn.jsdelivr.net/npm/dompurify@3.1.6/dist/purify.min.js"></script> |
| 68 | 69 | <script src="/static/eval_web.js"></script> |
| 69 | 70 | </body> |
| 70 | -</html> | |
| 71 | 71 | \ No newline at end of file |
| 72 | +</html> | ... | ... |
scripts/evaluation/tune_fusion.py
| ... | ... | @@ -150,7 +150,7 @@ def render_markdown(summary: Dict[str, Any]) -> str: |
| 150 | 150 | "", |
| 151 | 151 | "## Experiments", |
| 152 | 152 | "", |
| 153 | - "| Rank | Name | Score | MAP_3 | MAP_2_3 | P@5 | P@10 | Config |", | |
| 153 | + "| Rank | Name | Score | NDCG@10 | NDCG@20 | Strong@10 | Gain Recall@50 | Config |", | |
| 154 | 154 | "|---|---|---:|---:|---:|---:|---:|---|", |
| 155 | 155 | ] |
| 156 | 156 | for idx, item in enumerate(summary["experiments"], start=1): |
| ... | ... | @@ -162,10 +162,10 @@ def render_markdown(summary: Dict[str, Any]) -> str: |
| 162 | 162 | str(idx), |
| 163 | 163 | item["name"], |
| 164 | 164 | str(item["score"]), |
| 165 | - str(metrics.get("MAP_3", "")), | |
| 166 | - str(metrics.get("MAP_2_3", "")), | |
| 167 | - str(metrics.get("P@5", "")), | |
| 168 | - str(metrics.get("P@10", "")), | |
| 165 | + str(metrics.get("NDCG@10", "")), | |
| 166 | + str(metrics.get("NDCG@20", "")), | |
| 167 | + str(metrics.get("Strong_Precision@10", "")), | |
| 168 | + str(metrics.get("Gain_Recall@50", "")), | |
| 169 | 169 | item["config_snapshot_path"], |
| 170 | 170 | ] |
| 171 | 171 | ) |
| ... | ... | @@ -206,7 +206,7 @@ def build_parser() -> argparse.ArgumentParser: |
| 206 | 206 | parser.add_argument("--language", default="en") |
| 207 | 207 | parser.add_argument("--experiments-file", required=True) |
| 208 | 208 | parser.add_argument("--search-base-url", default="http://127.0.0.1:6002") |
| 209 | - parser.add_argument("--score-metric", default="MAP_3") | |
| 209 | + parser.add_argument("--score-metric", default="NDCG@10") | |
| 210 | 210 | parser.add_argument("--apply-best", action="store_true") |
| 211 | 211 | parser.add_argument("--force-refresh-labels-first-pass", action="store_true") |
| 212 | 212 | return parser | ... | ... |