Blame view

perf_reports/20260318/translation_local_models/README.md 4.38 KB
2a6d9d76   tangwang   更新了压测脚本和文档,让“单条请求...
1
2
3
  # Local Translation Model Benchmark Report
  
  测试脚本:
3abbc95a   tangwang   重构(scripts): 整理sc...
4
  - [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
2a6d9d76   tangwang   更新了压测脚本和文档,让“单条请求...
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
  
  完整结果:
  - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  - JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  
  测试时间:
  - `2026-03-18`
  
  环境:
  - GPU:`Tesla T4 16GB`
  - Driver / CUDA:`570.158.01 / 12.8`
  - Python env:`.venv-translator`
  - Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  - 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
  
  ## Method
  
  这轮把结果拆成 3 类:
  
  - `batch_sweep`
    固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64`
  - `concurrency_sweep`
    固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64`
  - `batch x concurrency matrix`
    组合压测,保留 `batch_size * concurrency <= 128`
  
  统一设定:
  - 关闭 cache:`--disable-cache`
  - `batch_sweep`:每档 `256` items
  - `concurrency_sweep`:每档 `32` requests
  - `matrix`:每档 `32` requests
  - 预热:`1` batch
  
  复现命令:
  
  ```bash
  cd /data/saas-search
3abbc95a   tangwang   重构(scripts): 整理sc...
42
  ./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
2a6d9d76   tangwang   更新了压测脚本和文档,让“单条请求...
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
    --suite extended \
    --disable-cache \
    --serial-items-per-case 256 \
    --concurrency-requests-per-case 32 \
    --concurrency-batch-size 1 \
    --output-dir perf_reports/20260318/translation_local_models
  ```
  
  ## Key Results
  
  ### 1. 单流 batch sweep
  
  | Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
  |---|---|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
  | `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
  | `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
  | `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
  
  解读:
  - 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高
  - 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选
  - 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
  
  ### 2. 单条请求并发 sweep
  
  | Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
  |---|---|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
  | `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
  | `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
  | `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
  
  解读:
  - `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟
  - 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
  - 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh`
  
  ### 3. batch x concurrency matrix
  
  最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下):
  
  | Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
  |---|---|---:|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
  | `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
  | `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
  | `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
  
  解读:
  - 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定
  - 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升
  - 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐
  
  ## Recommendation
  
  - 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径
  - 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64`
  - 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段