Blame view

perf_reports/20260318/translation_local_models_ct2/README.md 5.99 KB
46ce858d   tangwang   在NLLB模型的 /data/sa...
1
2
3
  # Local Translation Model Benchmark Report (CTranslate2)
  
  测试脚本:
3abbc95a   tangwang   重构(scripts): 整理sc...
4
  - [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
46ce858d   tangwang   在NLLB模型的 /data/sa...
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
  
  本轮 CT2 结果:
  - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md)
  - JSON:[`translation_local_models_ct2_extended_233253.json`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.json)
  
  对照基线:
  - 基线 README:[`../translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
  - 基线 Markdown:[`../translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  - 基线 JSON:[`../translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  - 对比分析:[`comparison_vs_hf_baseline.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/comparison_vs_hf_baseline.md)
  
  测试时间:
  - `2026-03-18`
  
  环境:
  - GPU:`Tesla T4 16GB`
  - Python env:`.venv-translator`
  - Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  - CTranslate2:`4.7.1`
  - 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
  
  ## Method
  
  本轮参数与基线保持一致,方便直接对比:
  
  - `suite=extended`
  - 关闭 cache:`--disable-cache`
  - `batch_sweep`:每档 `256` items
  - `concurrency_sweep`:每档 `32` requests
  - `matrix`:每档 `32` requests
  - `concurrency_batch_size=1`
  - `batch_size * concurrency <= 128`
  - 预热:`1` batch
  
  复现命令:
  
  ```bash
  cd /data/saas-search
  ./.venv-translator/bin/python - <<'PY'
  import json
  from datetime import datetime
  from pathlib import Path
  from types import SimpleNamespace
  
3abbc95a   tangwang   重构(scripts): 整理sc...
49
  from benchmarks.translation.benchmark_translation_local_models import (
46ce858d   tangwang   在NLLB模型的 /data/sa...
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
      SCENARIOS,
      benchmark_extended_scenario,
      build_environment_info,
      render_markdown_report,
  )
  
  output_dir = Path("perf_reports/20260318/translation_local_models_ct2")
  output_dir.mkdir(parents=True, exist_ok=True)
  
  common = dict(
      csv_path="products_analyzed.csv",
      limit=0,
      output_dir=str(output_dir),
      single=True,
      scene="sku_name",
      batch_size=0,
      device_override="",
      torch_dtype_override="",
      max_new_tokens=0,
      num_beams=0,
      attn_implementation="",
      warmup_batches=1,
      disable_cache=True,
      suite="extended",
      batch_size_list="",
      concurrency_list="",
      serial_items_per_case=256,
      concurrency_requests_per_case=32,
      concurrency_batch_size=1,
      max_batch_concurrency_product=128,
  )
  
  report = {
      "generated_at": datetime.now().isoformat(timespec="seconds"),
      "suite": "extended",
      "environment": build_environment_info(),
      "scenarios": [],
  }
  
  for scenario in SCENARIOS:
      args = SimpleNamespace(
          **common,
          model=scenario["model"],
          source_lang=scenario["source_lang"],
          target_lang=scenario["target_lang"],
          column=scenario["column"],
      )
      result = benchmark_extended_scenario(args)
      result["scenario"]["name"] = scenario["name"]
      report["scenarios"].append(result)
  
  stamp = datetime.now().strftime("%H%M%S")
  (output_dir / f"translation_local_models_ct2_extended_{stamp}.json").write_text(
      json.dumps(report, ensure_ascii=False, indent=2),
      encoding="utf-8",
  )
  (output_dir / f"translation_local_models_ct2_extended_{stamp}.md").write_text(
      render_markdown_report(report),
      encoding="utf-8",
  )
  PY
  ```
  
  ## Key Results
  
  ### 1. 单流 batch sweep
  
  | Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
  |---|---|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `64` | `104.61` | `55.68` | `371.36` |
  | `nllb-200-distilled-600m` | `en -> zh` | `64` | `91.26` | `42.42` | `408.81` |
  | `opus-mt-zh-en` | `zh -> en` | `64` | `218.5` | `111.61` | `257.18` |
  | `opus-mt-en-zh` | `en -> zh` | `32` | `145.12` | `102.05` | `396.16` |
  
  解读:
  - 4 个方向的 bulk 吞吐都明显高于原始 Hugging Face / PyTorch 基线。
  - `nllb en->zh` 的 batch 16 吞吐从 `13.52` 提升到 `42.42 items/s`,提升最明显。
  - `opus-mt-en-zh` 在 CT2 版本里最佳 batch 从 `64` 变成了 `32`,说明它不再需要极端大 batch 才能吃满吞吐。
  
  ### 2. 单条请求并发 sweep
  
  | Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
  |---|---|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `8.97` | `163.53` | `1039.32` | `3031.64` |
  | `nllb-200-distilled-600m` | `en -> zh` | `5.83` | `259.52` | `2193.01` | `5611.21` |
  | `opus-mt-zh-en` | `zh -> en` | `27.85` | `60.61` | `390.32` | `1061.35` |
  | `opus-mt-en-zh` | `en -> zh` | `11.02` | `351.74` | `863.08` | `2459.49` |
  
  解读:
  - 在线 query 指标提升非常明显,特别是 `batch_size=1` 的 `p95` 和 `items/s`
  - CT2 下并发上升仍会推高尾延迟,但恶化幅度比基线小得多。
  - `opus-mt-zh-en` 仍然是在线场景最稳的本地模型;`nllb` 现在也进入了更可用的区间。
  
  ### 3. 是否达到预期
  
  结论:
  - **达到了,而且幅度很大。**
  - 本轮 CT2 版本已经满足“在线性能显著增强”的目标,不需要继续为吞吐/延迟做额外紧急优化。
  
  判断依据:
  - 4 个方向在 `concurrency=1` 下的 `items/s` 全部提升到原来的 `2.0x-3.1x`
  - 4 个方向在 `concurrency=1` 下的 `p95` 全部下降到原来的 `29%-44%`
  - NLLB 两个方向的 `batch_size=16` 吞吐分别提升 `2.04x` 和 `3.14x`
  
  ## Notes
  
  - 这轮 `peak_gpu_memory_gb` 基本显示为 `0.0`,不是“CT2 不占显存”,而是当前脚本用的是 `torch.cuda` 统计,无法观测 CT2 的原生 CUDA 分配。
  - 如果后续要补充“显存对比”维度,建议新增 `nvidia-smi` 采样或 NVML 指标采集。