Commit 7ddd4cb3acf5e2e0b748467448c83348c87eff20

Authored by tangwang
1 parent 9df421ed

评估体系从三等级->四等级 Exact Match / High Relevant / Low Relevant /

Irrelevant
scripts/evaluation/README.md
... ... @@ -2,7 +2,7 @@
2 2  
3 3 This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
4 4  
5   -**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when coverage is incomplete.
  5 +**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system and ranking metrics centered on `NDCG`.
6 6  
7 7 ## What it does
8 8  
... ... @@ -112,9 +112,33 @@ Default root: `artifacts/search_evaluation/`
112 112  
113 113 ## Labels
114 114  
115   -- **Exact** — Matches intended product type and all explicit required attributes.
116   -- **Partial** — Main intent matches; attributes missing, approximate, or weaker.
117   -- **Irrelevant** — Type mismatch or conflicting required attributes.
  115 +- **Exact Match** — Matches intended product type and all explicit required attributes.
  116 +- **High Relevant** — Main intent matches and is a strong substitute, but some attributes are missing, weaker, or slightly off.
  117 +- **Low Relevant** — Only a weak substitute; may share scenario, style, or broad category but is no longer a strong match.
  118 +- **Irrelevant** — Type mismatch or important conflicts make it a poor search result.
  119 +
  120 +## Metric design
  121 +
  122 +This framework now follows graded ranking evaluation closer to e-commerce best practice instead of collapsing everything into binary relevance.
  123 +
  124 +- **Primary metric: `NDCG@10`**
  125 + Uses the four labels as graded gains and rewards both relevance and early placement.
  126 +- **Gain scheme**
  127 + `Exact Match=7`, `High Relevant=3`, `Low Relevant=1`, `Irrelevant=0`
  128 + The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup.
  129 +- **Why this is better**
  130 + `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Exact Match` with a `Low Relevant` item is penalized more than swapping `High Relevant` with `Low Relevant`.
  131 +
  132 +The reported metrics are:
  133 +
  134 +- **`NDCG@5`, `NDCG@10`, `NDCG@20`, `NDCG@50`** — Primary graded ranking quality.
  135 +- **`Exact_Precision@K`** — Strict top-slot quality when only `Exact Match` counts.
  136 +- **`Strong_Precision@K`** — Business-facing top-slot quality where `Exact Match + High Relevant` count as strong positives.
  137 +- **`Useful_Precision@K`** — Broader usefulness where any non-irrelevant result counts.
  138 +- **`Gain_Recall@K`** — Gain captured in the returned list versus the judged label pool for the query.
  139 +- **`Exact_Success@K` / `Strong_Success@K`** — Whether at least one exact or strong result appears in the first K.
  140 +- **`MRR_Exact@10` / `MRR_Strong@10`** — How early the first exact or strong result appears.
  141 +- **`Avg_Grade@10`** — Average relevance grade of the visible first page.
118 142  
119 143 **Labeler modes:** `simple` (default): one judging pass per batch with the standard relevance prompt. `complex`: query-profile extraction plus extra guardrails (for structured experiments).
120 144  
... ... @@ -139,11 +163,11 @@ Default root: `artifacts/search_evaluation/`
139 163  
140 164 ## Web UI
141 165  
142   -Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, top recalls, missed Exact/Partial, and coverage tips for unlabeled hits.
  166 +Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
143 167  
144 168 ## Batch reports
145 169  
146   -Each run stores aggregate and per-query metrics, label distribution, timestamp, and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
  170 +Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
147 171  
148 172 ## Caveats
149 173  
... ...
scripts/evaluation/eval_framework/constants.py
... ... @@ -14,8 +14,22 @@ RELEVANCE_IRRELEVANT = "Irrelevant"
14 14  
15 15 VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT})
16 16  
17   -# Precision / MAP "positive" set (all non-irrelevant tiers)
  17 +# Useful label sets for binary diagnostic slices layered on top of graded ranking metrics.
18 18 RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW})
  19 +RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH})
  20 +
  21 +# Graded relevance for ranking evaluation.
  22 +# We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics.
  23 +RELEVANCE_GRADE_MAP = {
  24 + RELEVANCE_EXACT: 3,
  25 + RELEVANCE_HIGH: 2,
  26 + RELEVANCE_LOW: 1,
  27 + RELEVANCE_IRRELEVANT: 0,
  28 +}
  29 +RELEVANCE_GAIN_MAP = {
  30 + label: (2 ** grade) - 1
  31 + for label, grade in RELEVANCE_GRADE_MAP.items()
  32 +}
19 33  
20 34 _LEGACY_LABEL_MAP = {
21 35 "Exact": RELEVANCE_EXACT,
... ...
scripts/evaluation/eval_framework/framework.py
... ... @@ -26,6 +26,7 @@ from .constants import (
26 26 DEFAULT_RERANK_HIGH_THRESHOLD,
27 27 DEFAULT_SEARCH_RECALL_TOP_K,
28 28 RELEVANCE_EXACT,
  29 + RELEVANCE_GAIN_MAP,
29 30 RELEVANCE_HIGH,
30 31 RELEVANCE_IRRELEVANT,
31 32 RELEVANCE_LOW,
... ... @@ -50,6 +51,18 @@ from .utils import (
50 51 _log = logging.getLogger("search_eval.framework")
51 52  
52 53  
  54 +def _metric_context_payload() -> Dict[str, Any]:
  55 + return {
  56 + "primary_metric": "NDCG@10",
  57 + "gain_scheme": dict(RELEVANCE_GAIN_MAP),
  58 + "notes": [
  59 + "NDCG uses graded gains derived from the four relevance labels.",
  60 + "Strong metrics treat Exact Match and High Relevant as strong business positives.",
  61 + "Useful metrics treat any non-irrelevant item as useful recall coverage.",
  62 + ],
  63 + }
  64 +
  65 +
53 66 def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]:
54 67 """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``."""
55 68 out: Dict[str, str] = {}
... ... @@ -607,7 +620,7 @@ class SearchEvaluationFramework:
607 620 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
608 621 for item in search_labeled_results[:100]
609 622 ]
610   - metrics = compute_query_metrics(top100_labels)
  623 + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
611 624 output_dir = ensure_dir(self.artifact_root / "query_builds")
612 625 run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
613 626 output_json_path = output_dir / f"{run_id}.json"
... ... @@ -629,6 +642,7 @@ class SearchEvaluationFramework:
629 642 "pool_size": len(pool_docs),
630 643 },
631 644 "metrics_top100": metrics,
  645 + "metric_context": _metric_context_payload(),
632 646 "search_results": search_labeled_results,
633 647 "full_rerank_top": rerank_top_results,
634 648 }
... ... @@ -816,7 +830,7 @@ class SearchEvaluationFramework:
816 830 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
817 831 for item in search_labeled_results[:100]
818 832 ]
819   - metrics = compute_query_metrics(top100_labels)
  833 + metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
820 834 output_dir = ensure_dir(self.artifact_root / "query_builds")
821 835 run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
822 836 output_json_path = output_dir / f"{run_id}.json"
... ... @@ -838,6 +852,7 @@ class SearchEvaluationFramework:
838 852 "ordered_union_size": pool_docs_count,
839 853 },
840 854 "metrics_top100": metrics,
  855 + "metric_context": _metric_context_payload(),
841 856 "search_results": search_labeled_results,
842 857 "full_rerank_top": rerank_top_results,
843 858 }
... ... @@ -897,6 +912,10 @@ class SearchEvaluationFramework:
897 912 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
898 913 for item in labeled
899 914 ]
  915 + ideal_labels = [
  916 + label if label in VALID_LABELS else RELEVANCE_IRRELEVANT
  917 + for label in labels.values()
  918 + ]
900 919 label_stats = self.store.get_query_label_stats(self.tenant_id, query)
901 920 rerank_scores = self.store.get_rerank_scores(self.tenant_id, query)
902 921 relevant_missing_ids = [
... ... @@ -947,12 +966,13 @@ class SearchEvaluationFramework:
947 966 if unlabeled_hits:
948 967 tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.")
949 968 if not missing_relevant:
950   - tips.append("No cached non-irrelevant products were missed by this recall set.")
  969 + tips.append("No cached judged useful products were missed by this recall set.")
951 970 return {
952 971 "query": query,
953 972 "tenant_id": self.tenant_id,
954 973 "top_k": top_k,
955   - "metrics": compute_query_metrics(metric_labels),
  974 + "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels),
  975 + "metric_context": _metric_context_payload(),
956 976 "results": labeled,
957 977 "missing_relevant": missing_relevant,
958 978 "label_stats": {
... ... @@ -1004,12 +1024,12 @@ class SearchEvaluationFramework:
1004 1024 )
1005 1025 m = live["metrics"]
1006 1026 _log.info(
1007   - "[batch-eval] (%s/%s) query=%r P@10=%s MAP_3=%s total_hits=%s",
  1027 + "[batch-eval] (%s/%s) query=%r NDCG@10=%s Strong_Precision@10=%s total_hits=%s",
1008 1028 q_index,
1009 1029 total_q,
1010 1030 query,
1011   - m.get("P@10"),
1012   - m.get("MAP_3"),
  1031 + m.get("NDCG@10"),
  1032 + m.get("Strong_Precision@10"),
1013 1033 live.get("total"),
1014 1034 )
1015 1035 aggregate = aggregate_metrics([item["metrics"] for item in per_query])
... ... @@ -1033,6 +1053,7 @@ class SearchEvaluationFramework:
1033 1053 "queries": list(queries),
1034 1054 "top_k": top_k,
1035 1055 "aggregate_metrics": aggregate,
  1056 + "metric_context": _metric_context_payload(),
1036 1057 "aggregate_distribution": aggregate_distribution,
1037 1058 "per_query": per_query,
1038 1059 "config_snapshot_path": str(config_snapshot_path),
... ...
scripts/evaluation/eval_framework/metrics.py
1   -"""IR metrics for labeled result lists."""
  1 +"""Ranking metrics for graded e-commerce relevance labels."""
2 2  
3 3 from __future__ import annotations
4 4  
5   -from typing import Dict, Sequence
  5 +import math
  6 +from typing import Dict, Iterable, Sequence
6 7  
7   -from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_NON_IRRELEVANT
  8 +from .constants import (
  9 + RELEVANCE_EXACT,
  10 + RELEVANCE_GAIN_MAP,
  11 + RELEVANCE_GRADE_MAP,
  12 + RELEVANCE_HIGH,
  13 + RELEVANCE_IRRELEVANT,
  14 + RELEVANCE_LOW,
  15 + RELEVANCE_NON_IRRELEVANT,
  16 + RELEVANCE_STRONG,
  17 +)
8 18  
9 19  
10   -def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -> float:
  20 +def _normalize_label(label: str) -> str:
  21 + if label in RELEVANCE_GRADE_MAP:
  22 + return label
  23 + return RELEVANCE_IRRELEVANT
  24 +
  25 +
  26 +def _gains_for_labels(labels: Sequence[str]) -> list[float]:
  27 + return [float(RELEVANCE_GAIN_MAP.get(_normalize_label(label), 0.0)) for label in labels]
  28 +
  29 +
  30 +def _binary_hits(labels: Sequence[str], relevant: Iterable[str]) -> list[int]:
  31 + relevant_set = set(relevant)
  32 + return [1 if _normalize_label(label) in relevant_set else 0 for label in labels]
  33 +
  34 +
  35 +def _precision_at_k_from_hits(hits: Sequence[int], k: int) -> float:
11 36 if k <= 0:
12 37 return 0.0
13   - sliced = list(labels[:k])
  38 + sliced = list(hits[:k])
14 39 if not sliced:
15 40 return 0.0
16   - rel = set(relevant)
17   - hits = sum(1 for label in sliced if label in rel)
18   - return hits / float(min(k, len(sliced)))
  41 + return sum(sliced) / float(len(sliced))
  42 +
  43 +
  44 +def _success_at_k_from_hits(hits: Sequence[int], k: int) -> float:
  45 + if k <= 0:
  46 + return 0.0
  47 + return 1.0 if any(hits[:k]) else 0.0
  48 +
  49 +
  50 +def _reciprocal_rank_from_hits(hits: Sequence[int], k: int) -> float:
  51 + if k <= 0:
  52 + return 0.0
  53 + for idx, hit in enumerate(hits[:k], start=1):
  54 + if hit:
  55 + return 1.0 / float(idx)
  56 + return 0.0
19 57  
20 58  
21   -def average_precision(labels: Sequence[str], relevant: Sequence[str]) -> float:
22   - rel = set(relevant)
23   - hit_count = 0
24   - precision_sum = 0.0
25   - for idx, label in enumerate(labels, start=1):
26   - if label not in rel:
  59 +def _dcg_at_k(gains: Sequence[float], k: int) -> float:
  60 + if k <= 0:
  61 + return 0.0
  62 + total = 0.0
  63 + for idx, gain in enumerate(gains[:k], start=1):
  64 + if gain <= 0.0:
27 65 continue
28   - hit_count += 1
29   - precision_sum += hit_count / idx
30   - if hit_count == 0:
  66 + total += gain / math.log2(idx + 1.0)
  67 + return total
  68 +
  69 +
  70 +def _ndcg_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float:
  71 + actual_gains = _gains_for_labels(labels)
  72 + ideal_gains = sorted(_gains_for_labels(ideal_labels), reverse=True)
  73 + dcg = _dcg_at_k(actual_gains, k)
  74 + idcg = _dcg_at_k(ideal_gains, k)
  75 + if idcg <= 0.0:
  76 + return 0.0
  77 + return dcg / idcg
  78 +
  79 +
  80 +def _gain_recall_at_k(labels: Sequence[str], ideal_labels: Sequence[str], k: int) -> float:
  81 + ideal_total_gain = sum(_gains_for_labels(ideal_labels))
  82 + if ideal_total_gain <= 0.0:
31 83 return 0.0
32   - return precision_sum / hit_count
  84 + actual_gain = sum(_gains_for_labels(labels[:k]))
  85 + return actual_gain / ideal_total_gain
33 86  
34 87  
35   -def compute_query_metrics(labels: Sequence[str]) -> Dict[str, float]:
36   - """P@k / MAP_3: Exact Match only. P@k_2_3 / MAP_2_3: any non-irrelevant tier (legacy metric names)."""
  88 +def _grade_avg_at_k(labels: Sequence[str], k: int) -> float:
  89 + if k <= 0:
  90 + return 0.0
  91 + sliced = [_normalize_label(label) for label in labels[:k]]
  92 + if not sliced:
  93 + return 0.0
  94 + return sum(float(RELEVANCE_GRADE_MAP.get(label, 0)) for label in sliced) / float(len(sliced))
  95 +
  96 +
  97 +def compute_query_metrics(
  98 + labels: Sequence[str],
  99 + *,
  100 + ideal_labels: Sequence[str] | None = None,
  101 +) -> Dict[str, float]:
  102 + """Compute graded ranking metrics plus binary diagnostic slices.
  103 +
  104 + `labels` are the ranked results returned by search.
  105 + `ideal_labels` is the judged label pool for the same query; when omitted we fall back
  106 + to the retrieved labels, which still keeps the metrics well-defined.
  107 + """
  108 +
  109 + ideal = list(ideal_labels) if ideal_labels is not None else list(labels)
37 110 metrics: Dict[str, float] = {}
38   - non_irrel = list(RELEVANCE_NON_IRRELEVANT)
  111 +
  112 + exact_hits = _binary_hits(labels, [RELEVANCE_EXACT])
  113 + strong_hits = _binary_hits(labels, RELEVANCE_STRONG)
  114 + useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT)
  115 +
39 116 for k in (5, 10, 20, 50):
40   - metrics[f"P@{k}"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT]), 6)
41   - metrics[f"P@{k}_2_3"] = round(precision_at_k(labels, k, non_irrel), 6)
42   - metrics["MAP_3"] = round(average_precision(labels, [RELEVANCE_EXACT]), 6)
43   - metrics["MAP_2_3"] = round(average_precision(labels, non_irrel), 6)
  117 + metrics[f"NDCG@{k}"] = round(_ndcg_at_k(labels, ideal, k), 6)
  118 + for k in (5, 10, 20):
  119 + metrics[f"Exact_Precision@{k}"] = round(_precision_at_k_from_hits(exact_hits, k), 6)
  120 + metrics[f"Strong_Precision@{k}"] = round(_precision_at_k_from_hits(strong_hits, k), 6)
  121 + for k in (10, 20, 50):
  122 + metrics[f"Useful_Precision@{k}"] = round(_precision_at_k_from_hits(useful_hits, k), 6)
  123 + metrics[f"Gain_Recall@{k}"] = round(_gain_recall_at_k(labels, ideal, k), 6)
  124 + for k in (5, 10):
  125 + metrics[f"Exact_Success@{k}"] = round(_success_at_k_from_hits(exact_hits, k), 6)
  126 + metrics[f"Strong_Success@{k}"] = round(_success_at_k_from_hits(strong_hits, k), 6)
  127 + metrics["MRR_Exact@10"] = round(_reciprocal_rank_from_hits(exact_hits, 10), 6)
  128 + metrics["MRR_Strong@10"] = round(_reciprocal_rank_from_hits(strong_hits, 10), 6)
  129 + metrics["Avg_Grade@10"] = round(_grade_avg_at_k(labels, 10), 6)
44 130 return metrics
45 131  
46 132  
47 133 def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, float]:
48 134 if not metric_items:
49 135 return {}
50   - keys = sorted(metric_items[0].keys())
  136 + all_keys = sorted({key for item in metric_items for key in item.keys()})
51 137 return {
52 138 key: round(sum(float(item.get(key, 0.0)) for item in metric_items) / len(metric_items), 6)
53   - for key in keys
  139 + for key in all_keys
54 140 }
55 141  
56 142  
... ...
scripts/evaluation/eval_framework/reports.py
... ... @@ -7,6 +7,19 @@ from typing import Any, Dict
7 7 from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW
8 8  
9 9  
  10 +def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -> None:
  11 + primary_keys = ("NDCG@5", "NDCG@10", "NDCG@20", "Exact_Precision@10", "Strong_Precision@10", "Gain_Recall@50")
  12 + included = set()
  13 + for key in primary_keys:
  14 + if key in metrics:
  15 + lines.append(f"- {key}: {metrics[key]}")
  16 + included.add(key)
  17 + for key, value in sorted(metrics.items()):
  18 + if key in included:
  19 + continue
  20 + lines.append(f"- {key}: {value}")
  21 +
  22 +
10 23 def render_batch_report_markdown(payload: Dict[str, Any]) -> str:
11 24 lines = [
12 25 "# Search Batch Evaluation",
... ... @@ -20,8 +33,16 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
20 33 "## Aggregate Metrics",
21 34 "",
22 35 ]
23   - for key, value in sorted((payload.get("aggregate_metrics") or {}).items()):
24   - lines.append(f"- {key}: {value}")
  36 + metric_context = payload.get("metric_context") or {}
  37 + if metric_context:
  38 + lines.extend(
  39 + [
  40 + f"- Primary metric: {metric_context.get('primary_metric', 'N/A')}",
  41 + f"- Gain scheme: {metric_context.get('gain_scheme', {})}",
  42 + "",
  43 + ]
  44 + )
  45 + _append_metric_block(lines, payload.get("aggregate_metrics") or {})
25 46 distribution = payload.get("aggregate_distribution") or {}
26 47 if distribution:
27 48 lines.extend(
... ... @@ -39,8 +60,7 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
39 60 for item in payload.get("per_query") or []:
40 61 lines.append(f"### {item['query']}")
41 62 lines.append("")
42   - for key, value in sorted((item.get("metrics") or {}).items()):
43   - lines.append(f"- {key}: {value}")
  63 + _append_metric_block(lines, item.get("metrics") or {})
44 64 distribution = item.get("distribution") or {}
45 65 lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}")
46 66 lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}")
... ...
scripts/evaluation/eval_framework/static/eval_web.css
... ... @@ -6,7 +6,8 @@
6 6 --line: #ddd4c6;
7 7 --accent: #0f766e;
8 8 --exact: #0f766e;
9   - --partial: #b7791f;
  9 + --high: #b7791f;
  10 + --low: #3b82a0;
10 11 --irrelevant: #b42318;
11 12 }
12 13 body { margin: 0; font-family: "IBM Plex Sans", "Segoe UI", sans-serif; color: var(--ink); background:
... ... @@ -29,6 +30,12 @@
29 30 button { border: 0; background: var(--accent); color: white; padding: 12px 16px; border-radius: 14px; cursor: pointer; font-weight: 600; }
30 31 button.secondary { background: #d9e6e3; color: #12433d; }
31 32 .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(170px, 1fr)); gap: 12px; margin-bottom: 16px; }
  33 + .metric-context { margin: 0 0 12px; line-height: 1.5; }
  34 + .metric-section { margin-bottom: 18px; }
  35 + .metric-section-head { display: flex; align-items: baseline; justify-content: space-between; gap: 12px; margin-bottom: 10px; }
  36 + .metric-section-head h3 { margin: 0; font-size: 14px; color: #12433d; }
  37 + .metric-section-head p { margin: 0; color: var(--muted); font-size: 12px; }
  38 + .metric-grid { margin-bottom: 0; }
32 39 .metric { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; }
33 40 .metric .label { font-size: 12px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.04em; }
34 41 .metric .value { font-size: 24px; font-weight: 700; margin-top: 4px; }
... ... @@ -36,8 +43,8 @@
36 43 .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; }
37 44 .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; }
38 45 .label-exact-match { background: var(--exact); }
39   - .label-high-relevant { background: var(--partial); }
40   - .label-low-relevant { background: #6b5b95; }
  46 + .label-high-relevant { background: var(--high); }
  47 + .label-low-relevant { background: var(--low); }
41 48 .label-irrelevant { background: var(--irrelevant); }
42 49 .badge-unknown { background: #637381; }
43 50 .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; }
... ... @@ -91,3 +98,13 @@
91 98 .report-modal-body.report-modal-loading, .report-modal-body.report-modal-error { color: var(--muted); font-style: italic; }
92 99 .tips { background: var(--panel); border: 1px solid var(--line); border-radius: 16px; padding: 14px; line-height: 1.6; }
93 100 .tip { margin-bottom: 6px; color: var(--muted); }
  101 + @media (max-width: 960px) {
  102 + .app { grid-template-columns: 1fr; }
  103 + .sidebar { border-right: 0; border-bottom: 1px solid var(--line); }
  104 + .metric-section-head { flex-direction: column; align-items: flex-start; }
  105 + }
  106 + @media (max-width: 640px) {
  107 + .main, .sidebar { padding: 16px; }
  108 + .result { grid-template-columns: 1fr; }
  109 + .thumb { width: 100%; max-width: 180px; height: auto; aspect-ratio: 1 / 1; }
  110 + }
... ...
scripts/evaluation/eval_framework/static/eval_web.js
1   - async function fetchJSON(url, options) {
2   - const res = await fetch(url, options);
3   - if (!res.ok) throw new Error(await res.text());
4   - return await res.json();
5   - }
6   - function renderMetrics(metrics) {
7   - const root = document.getElementById('metrics');
8   - root.innerHTML = '';
9   - Object.entries(metrics || {}).forEach(([key, value]) => {
10   - const card = document.createElement('div');
11   - card.className = 'metric';
12   - card.innerHTML = `<div class="label">${key}</div><div class="value">${value}</div>`;
13   - root.appendChild(card);
14   - });
15   - }
16   - function labelBadgeClass(label) {
17   - if (!label || label === 'Unknown') return 'badge-unknown';
18   - return 'label-' + String(label).toLowerCase().replace(/\s+/g, '-');
19   - }
20   - function renderResults(results, rootId='results', showRank=true) {
21   - const mount = document.getElementById(rootId);
22   - mount.innerHTML = '';
23   - (results || []).forEach(item => {
24   - const label = item.label || 'Unknown';
25   - const box = document.createElement('div');
26   - box.className = 'result';
27   - box.innerHTML = `
28   - <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || '-'}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : 'not recalled')}</div></div>
29   - <img class="thumb" src="${item.image_url || ''}" alt="" />
30   - <div>
31   - <div class="title">${item.title || ''}</div>
32   - ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''}
33   - <div class="options">
34   - <div>${(item.option_values || [])[0] || ''}</div>
35   - <div>${(item.option_values || [])[1] || ''}</div>
36   - <div>${(item.option_values || [])[2] || ''}</div>
37   - </div>
38   - </div>`;
39   - mount.appendChild(box);
40   - });
41   - if (!(results || []).length) {
42   - mount.innerHTML = '<div class="muted">None.</div>';
43   - }
44   - }
45   - function renderTips(data) {
46   - const root = document.getElementById('tips');
47   - const tips = [...(data.tips || [])];
48   - const stats = data.label_stats || {};
49   - tips.unshift(`Cached labels for query: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed (non-irrelevant): ${stats.missing_relevant_count || 0} — Exact: ${stats.missing_exact_count || 0}, High: ${stats.missing_high_count || 0}, Low: ${stats.missing_low_count || 0}.`);
50   - root.innerHTML = tips.map(text => `<div class="tip">${text}</div>`).join('');
51   - }
52   - async function loadQueries() {
53   - const data = await fetchJSON('/api/queries');
54   - const root = document.getElementById('queryList');
55   - root.innerHTML = '';
56   - data.queries.forEach(query => {
57   - const btn = document.createElement('button');
58   - btn.className = 'query-item';
59   - btn.textContent = query;
60   - btn.onclick = () => {
61   - document.getElementById('queryInput').value = query;
62   - runSingle();
63   - };
64   - root.appendChild(btn);
65   - });
66   - }
67   - function fmtMetric(m, key, digits) {
68   - const v = m && m[key];
69   - if (v == null || Number.isNaN(Number(v))) return null;
70   - const n = Number(v);
71   - return n.toFixed(digits);
72   - }
73   - function historySummaryHtml(meta) {
74   - const m = meta && meta.aggregate_metrics;
75   - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
76   - const parts = [];
77   - if (nq != null) parts.push(`<span>Queries</span> ${nq}`);
78   - const p10 = fmtMetric(m, 'P@10', 3);
79   - const p52 = fmtMetric(m, 'P@5_2_3', 3);
80   - const map3 = fmtMetric(m, 'MAP_3', 3);
81   - if (p10) parts.push(`<span>P@10</span> ${p10}`);
82   - if (p52) parts.push(`<span>P@5_2_3</span> ${p52}`);
83   - if (map3) parts.push(`<span>MAP_3</span> ${map3}`);
84   - if (!parts.length) return '';
85   - return `<div class="hstats">${parts.join(' · ')}</div>`;
86   - }
87   - async function loadHistory() {
88   - const data = await fetchJSON('/api/history');
89   - const root = document.getElementById('history');
90   - root.classList.remove('muted');
91   - const items = data.history || [];
92   - if (!items.length) {
93   - root.innerHTML = '<span class="muted">No history yet.</span>';
94   - return;
95   - }
96   - root.innerHTML = `<div class="history-list"></div>`;
97   - const list = root.querySelector('.history-list');
98   - items.forEach(item => {
99   - const btn = document.createElement('button');
100   - btn.type = 'button';
101   - btn.className = 'history-item';
102   - btn.setAttribute('aria-label', `Open report ${item.batch_id}`);
103   - const sum = historySummaryHtml(item.metadata);
104   - btn.innerHTML = `<div class="hid">${item.batch_id}</div>
105   - <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`;
106   - btn.onclick = () => openBatchReport(item.batch_id);
107   - list.appendChild(btn);
108   - });
109   - }
110   - let _lastReportPath = '';
111   - function closeReportModal() {
112   - const el = document.getElementById('reportModal');
113   - el.classList.remove('is-open');
114   - el.setAttribute('aria-hidden', 'true');
115   - document.getElementById('reportModalBody').innerHTML = '';
116   - document.getElementById('reportModalMeta').textContent = '';
117   - }
118   - async function openBatchReport(batchId) {
119   - const el = document.getElementById('reportModal');
120   - const body = document.getElementById('reportModalBody');
121   - const metaEl = document.getElementById('reportModalMeta');
122   - const titleEl = document.getElementById('reportModalTitle');
123   - el.classList.add('is-open');
124   - el.setAttribute('aria-hidden', 'false');
125   - titleEl.textContent = batchId;
126   - metaEl.textContent = '';
127   - body.className = 'report-modal-body batch-report-md report-modal-loading';
128   - body.textContent = 'Loading report…';
129   - try {
130   - const rep = await fetchJSON('/api/history/' + encodeURIComponent(batchId) + '/report');
131   - _lastReportPath = rep.report_markdown_path || '';
132   - metaEl.textContent = rep.report_markdown_path || '';
133   - const raw = marked.parse(rep.markdown || '', { gfm: true });
134   - const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } });
135   - body.className = 'report-modal-body batch-report-md';
136   - body.innerHTML = safe;
137   - } catch (e) {
138   - body.className = 'report-modal-body report-modal-error';
139   - body.textContent = (e && e.message) ? e.message : String(e);
140   - }
141   - }
142   - document.getElementById('reportModal').addEventListener('click', (ev) => {
143   - if (ev.target && ev.target.getAttribute('data-close-report') === '1') closeReportModal();
  1 +async function fetchJSON(url, options) {
  2 + const res = await fetch(url, options);
  3 + if (!res.ok) throw new Error(await res.text());
  4 + return await res.json();
  5 +}
  6 +
  7 +function fmtNumber(value, digits = 3) {
  8 + if (value == null || Number.isNaN(Number(value))) return "-";
  9 + return Number(value).toFixed(digits);
  10 +}
  11 +
  12 +function metricSections(metrics) {
  13 + const groups = [
  14 + {
  15 + title: "Primary Ranking",
  16 + keys: ["NDCG@5", "NDCG@10", "NDCG@20", "NDCG@50"],
  17 + description: "Graded ranking quality across the four relevance tiers.",
  18 + },
  19 + {
  20 + title: "Top Slot Quality",
  21 + keys: ["Exact_Precision@5", "Exact_Precision@10", "Strong_Precision@5", "Strong_Precision@10", "Strong_Precision@20"],
  22 + description: "How much of the visible top rank is exact or strong business relevance.",
  23 + },
  24 + {
  25 + title: "Recall Coverage",
  26 + keys: ["Useful_Precision@10", "Useful_Precision@20", "Useful_Precision@50", "Gain_Recall@10", "Gain_Recall@20", "Gain_Recall@50"],
  27 + description: "How much judged relevance is captured in the returned list.",
  28 + },
  29 + {
  30 + title: "First Good Result",
  31 + keys: ["Exact_Success@5", "Exact_Success@10", "Strong_Success@5", "Strong_Success@10", "MRR_Exact@10", "MRR_Strong@10", "Avg_Grade@10"],
  32 + description: "Whether users see a good result early and how good the top page feels overall.",
  33 + },
  34 + ];
  35 + const seen = new Set();
  36 + return groups
  37 + .map((group) => {
  38 + const items = group.keys
  39 + .filter((key) => metrics && Object.prototype.hasOwnProperty.call(metrics, key))
  40 + .map((key) => {
  41 + seen.add(key);
  42 + return [key, metrics[key]];
  43 + });
  44 + return { ...group, items };
  45 + })
  46 + .filter((group) => group.items.length)
  47 + .concat(
  48 + (() => {
  49 + const rest = Object.entries(metrics || {}).filter(([key]) => !seen.has(key));
  50 + return rest.length
  51 + ? [{ title: "Other Metrics", description: "", items: rest }]
  52 + : [];
  53 + })()
  54 + );
  55 +}
  56 +
  57 +function renderMetrics(metrics, metricContext) {
  58 + const root = document.getElementById("metrics");
  59 + root.innerHTML = "";
  60 + const ctx = document.getElementById("metricContext");
  61 + const gainScheme = metricContext && metricContext.gain_scheme;
  62 + const primary = metricContext && metricContext.primary_metric;
  63 + ctx.textContent = primary
  64 + ? `Primary metric: ${primary}. Gain scheme: ${Object.entries(gainScheme || {}).map(([label, gain]) => `${label}=${gain}`).join(", ")}.`
  65 + : "";
  66 +
  67 + metricSections(metrics || {}).forEach((section) => {
  68 + const wrap = document.createElement("section");
  69 + wrap.className = "metric-section";
  70 + wrap.innerHTML = `
  71 + <div class="metric-section-head">
  72 + <h3>${section.title}</h3>
  73 + ${section.description ? `<p>${section.description}</p>` : ""}
  74 + </div>
  75 + <div class="grid metric-grid"></div>
  76 + `;
  77 + const grid = wrap.querySelector(".metric-grid");
  78 + section.items.forEach(([key, value]) => {
  79 + const card = document.createElement("div");
  80 + card.className = "metric";
  81 + card.innerHTML = `<div class="label">${key}</div><div class="value">${fmtNumber(value)}</div>`;
  82 + grid.appendChild(card);
144 83 });
145   - document.addEventListener('keydown', (ev) => {
146   - if (ev.key === 'Escape') closeReportModal();
147   - });
148   - document.getElementById('reportCopyPath').addEventListener('click', async () => {
149   - if (!_lastReportPath) return;
150   - try {
151   - await navigator.clipboard.writeText(_lastReportPath);
152   - } catch (_) {}
153   - });
154   - async function runSingle() {
155   - const query = document.getElementById('queryInput').value.trim();
156   - if (!query) return;
157   - document.getElementById('status').textContent = `Evaluating "${query}"...`;
158   - const data = await fetchJSON('/api/search-eval', {
159   - method: 'POST',
160   - headers: {'Content-Type': 'application/json'},
161   - body: JSON.stringify({query, top_k: 100, auto_annotate: false})
162   - });
163   - document.getElementById('status').textContent = `Done. total=${data.total}`;
164   - renderMetrics(data.metrics);
165   - renderResults(data.results, 'results', true);
166   - renderResults(data.missing_relevant, 'missingRelevant', false);
167   - renderTips(data);
168   - loadHistory();
169   - }
170   - async function runBatch() {
171   - document.getElementById('status').textContent = 'Running batch evaluation...';
172   - const data = await fetchJSON('/api/batch-eval', {
173   - method: 'POST',
174   - headers: {'Content-Type': 'application/json'},
175   - body: JSON.stringify({top_k: 100, auto_annotate: false})
176   - });
177   - document.getElementById('status').textContent = `Batch done. report=${data.batch_id}`;
178   - renderMetrics(data.aggregate_metrics);
179   - renderResults([], 'results', true);
180   - renderResults([], 'missingRelevant', false);
181   - document.getElementById('tips').innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>';
182   - loadHistory();
183   - }
184   - loadQueries();
185   - loadHistory();
186   -
  84 + root.appendChild(wrap);
  85 + });
  86 +}
  87 +
  88 +function labelBadgeClass(label) {
  89 + if (!label || label === "Unknown") return "badge-unknown";
  90 + return "label-" + String(label).toLowerCase().replace(/\s+/g, "-");
  91 +}
  92 +
  93 +function renderResults(results, rootId = "results", showRank = true) {
  94 + const mount = document.getElementById(rootId);
  95 + mount.innerHTML = "";
  96 + (results || []).forEach((item) => {
  97 + const label = item.label || "Unknown";
  98 + const box = document.createElement("div");
  99 + box.className = "result";
  100 + box.innerHTML = `
  101 + <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || "-"}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : "not recalled")}</div></div>
  102 + <img class="thumb" src="${item.image_url || ""}" alt="" />
  103 + <div>
  104 + <div class="title">${item.title || ""}</div>
  105 + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ""}
  106 + <div class="options">
  107 + <div>${(item.option_values || [])[0] || ""}</div>
  108 + <div>${(item.option_values || [])[1] || ""}</div>
  109 + <div>${(item.option_values || [])[2] || ""}</div>
  110 + </div>
  111 + </div>`;
  112 + mount.appendChild(box);
  113 + });
  114 + if (!(results || []).length) {
  115 + mount.innerHTML = '<div class="muted">None.</div>';
  116 + }
  117 +}
  118 +
  119 +function renderTips(data) {
  120 + const root = document.getElementById("tips");
  121 + const tips = [...(data.tips || [])];
  122 + const stats = data.label_stats || {};
  123 + tips.unshift(
  124 + `Cached labels: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed judged useful results: ${stats.missing_relevant_count || 0} (Exact ${stats.missing_exact_count || 0}, High ${stats.missing_high_count || 0}, Low ${stats.missing_low_count || 0}).`
  125 + );
  126 + root.innerHTML = tips.map((text) => `<div class="tip">${text}</div>`).join("");
  127 +}
  128 +
  129 +async function loadQueries() {
  130 + const data = await fetchJSON("/api/queries");
  131 + const root = document.getElementById("queryList");
  132 + root.innerHTML = "";
  133 + data.queries.forEach((query) => {
  134 + const btn = document.createElement("button");
  135 + btn.className = "query-item";
  136 + btn.textContent = query;
  137 + btn.onclick = () => {
  138 + document.getElementById("queryInput").value = query;
  139 + runSingle();
  140 + };
  141 + root.appendChild(btn);
  142 + });
  143 +}
  144 +
  145 +function historySummaryHtml(meta) {
  146 + const m = meta && meta.aggregate_metrics;
  147 + const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
  148 + const parts = [];
  149 + if (nq != null) parts.push(`<span>Queries</span> ${nq}`);
  150 + if (m && m["NDCG@10"] != null) parts.push(`<span>NDCG@10</span> ${fmtNumber(m["NDCG@10"])}`);
  151 + if (m && m["Strong_Precision@10"] != null) parts.push(`<span>Strong@10</span> ${fmtNumber(m["Strong_Precision@10"])}`);
  152 + if (m && m["Gain_Recall@50"] != null) parts.push(`<span>Gain Recall@50</span> ${fmtNumber(m["Gain_Recall@50"])}`);
  153 + if (!parts.length) return "";
  154 + return `<div class="hstats">${parts.join(" · ")}</div>`;
  155 +}
  156 +
  157 +async function loadHistory() {
  158 + const data = await fetchJSON("/api/history");
  159 + const root = document.getElementById("history");
  160 + root.classList.remove("muted");
  161 + const items = data.history || [];
  162 + if (!items.length) {
  163 + root.innerHTML = '<span class="muted">No history yet.</span>';
  164 + return;
  165 + }
  166 + root.innerHTML = `<div class="history-list"></div>`;
  167 + const list = root.querySelector(".history-list");
  168 + items.forEach((item) => {
  169 + const btn = document.createElement("button");
  170 + btn.type = "button";
  171 + btn.className = "history-item";
  172 + btn.setAttribute("aria-label", `Open report ${item.batch_id}`);
  173 + const sum = historySummaryHtml(item.metadata);
  174 + btn.innerHTML = `<div class="hid">${item.batch_id}</div>
  175 + <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`;
  176 + btn.onclick = () => openBatchReport(item.batch_id);
  177 + list.appendChild(btn);
  178 + });
  179 +}
  180 +
  181 +let _lastReportPath = "";
  182 +
  183 +function closeReportModal() {
  184 + const el = document.getElementById("reportModal");
  185 + el.classList.remove("is-open");
  186 + el.setAttribute("aria-hidden", "true");
  187 + document.getElementById("reportModalBody").innerHTML = "";
  188 + document.getElementById("reportModalMeta").textContent = "";
  189 +}
  190 +
  191 +async function openBatchReport(batchId) {
  192 + const el = document.getElementById("reportModal");
  193 + const body = document.getElementById("reportModalBody");
  194 + const metaEl = document.getElementById("reportModalMeta");
  195 + const titleEl = document.getElementById("reportModalTitle");
  196 + el.classList.add("is-open");
  197 + el.setAttribute("aria-hidden", "false");
  198 + titleEl.textContent = batchId;
  199 + metaEl.textContent = "";
  200 + body.className = "report-modal-body batch-report-md report-modal-loading";
  201 + body.textContent = "Loading report…";
  202 + try {
  203 + const rep = await fetchJSON("/api/history/" + encodeURIComponent(batchId) + "/report");
  204 + _lastReportPath = rep.report_markdown_path || "";
  205 + metaEl.textContent = rep.report_markdown_path || "";
  206 + const raw = marked.parse(rep.markdown || "", { gfm: true });
  207 + const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } });
  208 + body.className = "report-modal-body batch-report-md";
  209 + body.innerHTML = safe;
  210 + } catch (e) {
  211 + body.className = "report-modal-body report-modal-error";
  212 + body.textContent = e && e.message ? e.message : String(e);
  213 + }
  214 +}
  215 +
  216 +document.getElementById("reportModal").addEventListener("click", (ev) => {
  217 + if (ev.target && ev.target.getAttribute("data-close-report") === "1") closeReportModal();
  218 +});
  219 +
  220 +document.addEventListener("keydown", (ev) => {
  221 + if (ev.key === "Escape") closeReportModal();
  222 +});
  223 +
  224 +document.getElementById("reportCopyPath").addEventListener("click", async () => {
  225 + if (!_lastReportPath) return;
  226 + try {
  227 + await navigator.clipboard.writeText(_lastReportPath);
  228 + } catch (_) {}
  229 +});
  230 +
  231 +async function runSingle() {
  232 + const query = document.getElementById("queryInput").value.trim();
  233 + if (!query) return;
  234 + document.getElementById("status").textContent = `Evaluating "${query}"...`;
  235 + const data = await fetchJSON("/api/search-eval", {
  236 + method: "POST",
  237 + headers: { "Content-Type": "application/json" },
  238 + body: JSON.stringify({ query, top_k: 100, auto_annotate: false }),
  239 + });
  240 + document.getElementById("status").textContent = `Done. total=${data.total}`;
  241 + renderMetrics(data.metrics, data.metric_context);
  242 + renderResults(data.results, "results", true);
  243 + renderResults(data.missing_relevant, "missingRelevant", false);
  244 + renderTips(data);
  245 + loadHistory();
  246 +}
  247 +
  248 +async function runBatch() {
  249 + document.getElementById("status").textContent = "Running batch evaluation...";
  250 + const data = await fetchJSON("/api/batch-eval", {
  251 + method: "POST",
  252 + headers: { "Content-Type": "application/json" },
  253 + body: JSON.stringify({ top_k: 100, auto_annotate: false }),
  254 + });
  255 + document.getElementById("status").textContent = `Batch done. report=${data.batch_id}`;
  256 + renderMetrics(data.aggregate_metrics, data.metric_context);
  257 + renderResults([], "results", true);
  258 + renderResults([], "missingRelevant", false);
  259 + document.getElementById("tips").innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>';
  260 + loadHistory();
  261 +}
  262 +
  263 +loadQueries();
  264 +loadHistory();
... ...
scripts/evaluation/eval_framework/static/index.html
... ... @@ -30,6 +30,7 @@
30 30 <div id="status" class="muted section"></div>
31 31 <section class="section">
32 32 <h2>Metrics</h2>
  33 + <p id="metricContext" class="muted metric-context"></p>
33 34 <div id="metrics" class="grid"></div>
34 35 </section>
35 36 <section class="section">
... ... @@ -37,7 +38,7 @@
37 38 <div id="results" class="results"></div>
38 39 </section>
39 40 <section class="section">
40   - <h2>Missed non-irrelevant (cached)</h2>
  41 + <h2>Missed judged useful results</h2>
41 42 <div id="missingRelevant" class="results"></div>
42 43 </section>
43 44 <section class="section">
... ... @@ -67,4 +68,4 @@
67 68 <script src="https://cdn.jsdelivr.net/npm/dompurify@3.1.6/dist/purify.min.js"></script>
68 69 <script src="/static/eval_web.js"></script>
69 70 </body>
70   -</html>
71 71 \ No newline at end of file
  72 +</html>
... ...
scripts/evaluation/tune_fusion.py
... ... @@ -150,7 +150,7 @@ def render_markdown(summary: Dict[str, Any]) -&gt; str:
150 150 "",
151 151 "## Experiments",
152 152 "",
153   - "| Rank | Name | Score | MAP_3 | MAP_2_3 | P@5 | P@10 | Config |",
  153 + "| Rank | Name | Score | NDCG@10 | NDCG@20 | Strong@10 | Gain Recall@50 | Config |",
154 154 "|---|---|---:|---:|---:|---:|---:|---|",
155 155 ]
156 156 for idx, item in enumerate(summary["experiments"], start=1):
... ... @@ -162,10 +162,10 @@ def render_markdown(summary: Dict[str, Any]) -&gt; str:
162 162 str(idx),
163 163 item["name"],
164 164 str(item["score"]),
165   - str(metrics.get("MAP_3", "")),
166   - str(metrics.get("MAP_2_3", "")),
167   - str(metrics.get("P@5", "")),
168   - str(metrics.get("P@10", "")),
  165 + str(metrics.get("NDCG@10", "")),
  166 + str(metrics.get("NDCG@20", "")),
  167 + str(metrics.get("Strong_Precision@10", "")),
  168 + str(metrics.get("Gain_Recall@50", "")),
169 169 item["config_snapshot_path"],
170 170 ]
171 171 )
... ... @@ -206,7 +206,7 @@ def build_parser() -&gt; argparse.ArgumentParser:
206 206 parser.add_argument("--language", default="en")
207 207 parser.add_argument("--experiments-file", required=True)
208 208 parser.add_argument("--search-base-url", default="http://127.0.0.1:6002")
209   - parser.add_argument("--score-metric", default="MAP_3")
  209 + parser.add_argument("--score-metric", default="NDCG@10")
210 210 parser.add_argument("--apply-best", action="store_true")
211 211 parser.add_argument("--force-refresh-labels-first-pass", action="store_true")
212 212 return parser
... ...