Blame view

perf_reports/20260318/translation_local_models/README.md 4.34 KB
2a6d9d76   tangwang   更新了压测脚本和文档,让“单条请求...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
  # Local Translation Model Benchmark Report
  
  测试脚本:
  - [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  
  完整结果:
  - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  - JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  
  测试时间:
  - `2026-03-18`
  
  环境:
  - GPU:`Tesla T4 16GB`
  - Driver / CUDA:`570.158.01 / 12.8`
  - Python env:`.venv-translator`
  - Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  - 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
  
  ## Method
  
  这轮把结果拆成 3 类:
  
  - `batch_sweep`
    固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64`
  - `concurrency_sweep`
    固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64`
  - `batch x concurrency matrix`
    组合压测,保留 `batch_size * concurrency <= 128`
  
  统一设定:
  - 关闭 cache:`--disable-cache`
  - `batch_sweep`:每档 `256` items
  - `concurrency_sweep`:每档 `32` requests
  - `matrix`:每档 `32` requests
  - 预热:`1` batch
  
  复现命令:
  
  ```bash
  cd /data/saas-search
  ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
    --suite extended \
    --disable-cache \
    --serial-items-per-case 256 \
    --concurrency-requests-per-case 32 \
    --concurrency-batch-size 1 \
    --output-dir perf_reports/20260318/translation_local_models
  ```
  
  ## Key Results
  
  ### 1. 单流 batch sweep
  
  | Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
  |---|---|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
  | `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
  | `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
  | `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
  
  解读:
  - 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高
  - 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选
  - 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
  
  ### 2. 单条请求并发 sweep
  
  | Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
  |---|---|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
  | `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
  | `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
  | `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
  
  解读:
  - `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟
  - 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
  - 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh`
  
  ### 3. batch x concurrency matrix
  
  最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下):
  
  | Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
  |---|---|---:|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
  | `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
  | `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
  | `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
  
  解读:
  - 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定
  - 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升
  - 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐
  
  ## Recommendation
  
  - 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径
  - 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64`
  - 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段