Commit 2a6d9d76f52556f65e4c3291fc402660f6a21817

Authored by tangwang
1 parent cd4ce66d

更新了压测脚本和文档,让“单条请求 / 单流 batch / 并发 /

batch×并发矩阵”彻底分开展示。

改动在这几处:

scripts/benchmark_translation_local_models.py:新增 --suite
extended,支持
batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size *
concurrency <= 128
的组合矩阵;并且单场景模式现在只加载目标模型,load_seconds
更干净,也支持 --disable-cache。
translation/README.md:把性能章节拆成了
batch_sweep、concurrency_sweep、batch x concurrency matrix
三块,补了这次复测的参数、复现命令和摘要表。
perf_reports/20260318/translation_local_models/README.md:新增本轮补测摘要。
完整结果在 translation_local_models_extended_221846.md 和
translation_local_models_extended_221846.json。
这次补测的核心结论很明确:

在线单条请求应该看 concurrency_sweep,也就是固定 batch_size=1 的表。
离线批量吞吐应该看 batch_sweep,4 个方向的最高 raw throughput 都出现在
batch_size=64,但更均衡的默认值仍更像 batch_size=16。
当前本地 seq2seq backend
有单模型锁,提升客户端并发几乎不涨吞吐,主要是把排队时间变成更高的
p95;所以并发更像“延迟预算”问题,不是“扩容吞吐”手段。
本轮在线单条里最快的是 opus-mt-zh-en;最慢、且并发放大最明显的是
nllb-200-distilled-600m en->zh。
docs/TODO.txt
1 1  
2 2  
3   -product_enrich : Partial Mode
  3 +
  4 +
  5 +nllb-200-distilled-600M性能优化
  6 +请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型,有哪些性能优化方案,提高线上翻译服务的吞吐量、降低耗时,搜索相关的在线推理服务方案,找到高性能的服务化方法
  7 +
  8 +cnclip的性能优化
  9 +
  10 +rerank 性能优化
  11 +
  12 +
  13 +超时
  14 +Query 分析阶段等待翻译/embedding 的硬超时
  15 +配置文件位置:config/config.yaml
  16 +配置项:query_config.async_wait_timeout_ms: 80
  17 +代码生效点:query/query_parser.py 使用该值换算成秒传给 wait(...)
  18 +2) Embedding HTTP 调用超时(Text/Image)
  19 +不再使用任何环境变量覆盖(之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用)
  20 +配置文件位置:config/config.yaml
  21 +配置项:services.embedding.providers.http.timeout_sec(已在 YAML 里补了示例默认 60)
  22 +代码生效点:
  23 +embeddings/text_encoder.py:requests.post(..., timeout=self.timeout_sec)
  24 +embeddings/image_encoder.py:requests.post(..., timeout=self.timeout_sec)
  25 +
  26 +
  27 +
  28 +
  29 +product_enrich : Partial Mode : done
4 30 https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR
5 31 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下:
6 32 [
... ... @@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men
15 41 }
16 42 ]
17 43 模型会以前缀内容为起点开始生成。
18   -
19 44 支持 非思考模式。
20 45  
21 46  
... ... @@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men
41 66  
42 67  
43 68  
44   -翻译的cache需要重构
45   -
46   -
47   -
48   -
49   -
50 69  
51 70 suggest 索引,现在是全量脚本,要交给金伟
52 71  
... ...
perf_reports/20260318/translation_local_models/README.md 0 → 100644
... ... @@ -0,0 +1,101 @@
  1 +# Local Translation Model Benchmark Report
  2 +
  3 +测试脚本:
  4 +- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  5 +
  6 +完整结果:
  7 +- Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  8 +- JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  9 +
  10 +测试时间:
  11 +- `2026-03-18`
  12 +
  13 +环境:
  14 +- GPU:`Tesla T4 16GB`
  15 +- Driver / CUDA:`570.158.01 / 12.8`
  16 +- Python env:`.venv-translator`
  17 +- Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  18 +- 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
  19 +
  20 +## Method
  21 +
  22 +这轮把结果拆成 3 类:
  23 +
  24 +- `batch_sweep`
  25 + 固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64`
  26 +- `concurrency_sweep`
  27 + 固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64`
  28 +- `batch x concurrency matrix`
  29 + 组合压测,保留 `batch_size * concurrency <= 128`
  30 +
  31 +统一设定:
  32 +- 关闭 cache:`--disable-cache`
  33 +- `batch_sweep`:每档 `256` items
  34 +- `concurrency_sweep`:每档 `32` requests
  35 +- `matrix`:每档 `32` requests
  36 +- 预热:`1` batch
  37 +
  38 +复现命令:
  39 +
  40 +```bash
  41 +cd /data/saas-search
  42 +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  43 + --suite extended \
  44 + --disable-cache \
  45 + --serial-items-per-case 256 \
  46 + --concurrency-requests-per-case 32 \
  47 + --concurrency-batch-size 1 \
  48 + --output-dir perf_reports/20260318/translation_local_models
  49 +```
  50 +
  51 +## Key Results
  52 +
  53 +### 1. 单流 batch sweep
  54 +
  55 +| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
  56 +|---|---|---:|---:|---:|---:|
  57 +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
  58 +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
  59 +| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
  60 +| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
  61 +
  62 +解读:
  63 +- 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高
  64 +- 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选
  65 +- 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
  66 +
  67 +### 2. 单条请求并发 sweep
  68 +
  69 +| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
  70 +|---|---|---:|---:|---:|---:|
  71 +| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
  72 +| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
  73 +| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
  74 +| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
  75 +
  76 +解读:
  77 +- `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟
  78 +- 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
  79 +- 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh`
  80 +
  81 +### 3. batch x concurrency matrix
  82 +
  83 +最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下):
  84 +
  85 +| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
  86 +|---|---|---:|---:|---:|---:|---:|
  87 +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
  88 +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
  89 +| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
  90 +| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
  91 +
  92 +解读:
  93 +- 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定
  94 +- 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升
  95 +- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐
  96 +
  97 +## Recommendation
  98 +
  99 +- 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径
  100 +- 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64`
  101 +- 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段
... ...
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md 0 → 100644
... ... @@ -0,0 +1,263 @@
  1 +# Local Translation Model Extended Benchmark
  2 +
  3 +- Generated at: `2026-03-18T21:28:09`
  4 +- Suite: `extended`
  5 +- Python: `3.12.3`
  6 +- Torch: `2.10.0+cu128`
  7 +- Transformers: `5.3.0`
  8 +- CUDA: `True`
  9 +- GPU: `Tesla T4` (15.56 GiB)
  10 +
  11 +## Reading Guide
  12 +
  13 +- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.
  14 +- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.
  15 +- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.
  16 +
  17 +## nllb-200-distilled-600m zh->en
  18 +
  19 +- Direction: `zh -> en`
  20 +- Column: `title_cn`
  21 +- Loaded rows: `2048`
  22 +- Load time: `6.118 s`
  23 +- Device: `cuda`
  24 +- DType: `torch.float16`
  25 +- Cache disabled: `True`
  26 +
  27 +### Batch Sweep (`concurrency=1`)
  28 +
  29 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  30 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  31 +| 1 | 256 | 256 | 2.91 | 2.91 | 343.48 | 315.98 | 616.27 | 2.633 |
  32 +| 4 | 256 | 64 | 8.44 | 2.11 | 474.17 | 474.75 | 722.95 | 2.659 |
  33 +| 8 | 256 | 32 | 14.85 | 1.86 | 538.68 | 564.97 | 728.47 | 2.699 |
  34 +| 16 | 256 | 16 | 27.28 | 1.7 | 586.59 | 633.16 | 769.18 | 2.765 |
  35 +| 32 | 256 | 8 | 38.6 | 1.21 | 829.04 | 761.74 | 1369.88 | 2.987 |
  36 +| 64 | 256 | 4 | 58.3 | 0.91 | 1097.73 | 919.35 | 1659.9 | 3.347 |
  37 +
  38 +### Concurrency Sweep (`batch_size=1`)
  39 +
  40 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  41 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  42 +| 1 | 32 | 32 | 4.17 | 4.17 | 239.99 | 226.34 | 373.27 | 3.347 |
  43 +| 2 | 32 | 32 | 4.1 | 4.1 | 477.99 | 459.36 | 703.96 | 3.347 |
  44 +| 4 | 32 | 32 | 4.1 | 4.1 | 910.74 | 884.71 | 1227.01 | 3.347 |
  45 +| 8 | 32 | 32 | 4.04 | 4.04 | 1697.73 | 1818.48 | 2383.8 | 3.347 |
  46 +| 16 | 32 | 32 | 4.07 | 4.07 | 2801.91 | 3473.63 | 4145.92 | 3.347 |
  47 +| 64 | 32 | 32 | 4.04 | 4.04 | 3714.49 | 3610.08 | 7337.3 | 3.347 |
  48 +
  49 +### Batch x Concurrency Matrix
  50 +
  51 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  52 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  53 +| 1 | 1 | 32 | 32 | 3.94 | 3.94 | 253.96 | 231.8 | 378.76 | 3.347 |
  54 +| 1 | 2 | 32 | 32 | 3.92 | 3.92 | 500.35 | 480.18 | 747.3 | 3.347 |
  55 +| 1 | 4 | 32 | 32 | 4.05 | 4.05 | 922.53 | 894.57 | 1235.7 | 3.347 |
  56 +| 1 | 8 | 32 | 32 | 4.08 | 4.08 | 1679.71 | 1806.95 | 2346.11 | 3.347 |
  57 +| 1 | 16 | 32 | 32 | 4.05 | 4.05 | 2816.51 | 3485.68 | 4181.39 | 3.347 |
  58 +| 1 | 64 | 32 | 32 | 4.06 | 4.06 | 3711.9 | 3614.34 | 7319.52 | 3.347 |
  59 +| 4 | 1 | 128 | 32 | 9.12 | 2.28 | 438.42 | 386.12 | 724.47 | 3.347 |
  60 +| 4 | 2 | 128 | 32 | 9.2 | 2.3 | 852.37 | 756.47 | 1366.7 | 3.347 |
  61 +| 4 | 4 | 128 | 32 | 9.17 | 2.29 | 1642.9 | 1524.56 | 2662.49 | 3.347 |
  62 +| 4 | 8 | 128 | 32 | 9.18 | 2.29 | 2974.7 | 3168.19 | 4594.99 | 3.347 |
  63 +| 4 | 16 | 128 | 32 | 9.23 | 2.31 | 4897.43 | 5815.91 | 8210.23 | 3.347 |
  64 +| 8 | 1 | 256 | 32 | 14.73 | 1.84 | 543.03 | 556.97 | 734.67 | 3.347 |
  65 +| 8 | 2 | 256 | 32 | 14.73 | 1.84 | 1064.16 | 1132.78 | 1405.53 | 3.347 |
  66 +| 8 | 4 | 256 | 32 | 14.79 | 1.85 | 2035.46 | 2237.69 | 2595.53 | 3.347 |
  67 +| 8 | 8 | 256 | 32 | 14.76 | 1.84 | 3763.14 | 4345.6 | 5025.02 | 3.347 |
  68 +| 8 | 16 | 256 | 32 | 14.77 | 1.85 | 6383.08 | 8053.36 | 9380.87 | 3.347 |
  69 +| 16 | 1 | 512 | 32 | 24.88 | 1.55 | 643.11 | 661.91 | 814.26 | 3.347 |
  70 +| 16 | 2 | 512 | 32 | 24.91 | 1.56 | 1265.17 | 1326.85 | 1576.96 | 3.347 |
  71 +| 16 | 4 | 512 | 32 | 24.94 | 1.56 | 2446.55 | 2594.26 | 3027.37 | 3.347 |
  72 +| 16 | 8 | 512 | 32 | 24.8 | 1.55 | 4548.92 | 5035.56 | 5852.87 | 3.347 |
  73 +| 32 | 1 | 1024 | 32 | 34.9 | 1.09 | 916.78 | 823.68 | 1637.59 | 3.347 |
  74 +| 32 | 2 | 1024 | 32 | 34.98 | 1.09 | 1807.39 | 1717.89 | 2514.27 | 3.347 |
  75 +| 32 | 4 | 1024 | 32 | 34.85 | 1.09 | 3513.42 | 3779.99 | 4750.34 | 3.347 |
  76 +| 64 | 1 | 2048 | 32 | 53.24 | 0.83 | 1202.07 | 993.59 | 1838.72 | 3.642 |
  77 +| 64 | 2 | 2048 | 32 | 53.95 | 0.84 | 2344.92 | 2465.53 | 3562.04 | 3.642 |
  78 +
  79 +## nllb-200-distilled-600m en->zh
  80 +
  81 +- Direction: `en -> zh`
  82 +- Column: `title`
  83 +- Loaded rows: `2048`
  84 +- Load time: `6.137 s`
  85 +- Device: `cuda`
  86 +- DType: `torch.float16`
  87 +- Cache disabled: `True`
  88 +
  89 +### Batch Sweep (`concurrency=1`)
  90 +
  91 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  92 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  93 +| 1 | 256 | 256 | 1.91 | 1.91 | 524.91 | 497.81 | 866.33 | 2.636 |
  94 +| 4 | 256 | 64 | 4.94 | 1.23 | 809.89 | 743.87 | 1599.74 | 2.684 |
  95 +| 8 | 256 | 32 | 8.25 | 1.03 | 969.5 | 796.64 | 1632.29 | 2.723 |
  96 +| 16 | 256 | 16 | 13.52 | 0.85 | 1183.29 | 1169.28 | 1649.65 | 2.817 |
  97 +| 32 | 256 | 8 | 21.27 | 0.66 | 1504.54 | 1647.92 | 1827.16 | 3.007 |
  98 +| 64 | 256 | 4 | 32.64 | 0.51 | 1961.02 | 1985.8 | 2031.25 | 3.408 |
  99 +
  100 +### Concurrency Sweep (`batch_size=1`)
  101 +
  102 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  103 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  104 +| 1 | 32 | 32 | 2.16 | 2.16 | 463.18 | 439.54 | 670.78 | 3.408 |
  105 +| 2 | 32 | 32 | 2.15 | 2.15 | 920.48 | 908.27 | 1213.3 | 3.408 |
  106 +| 4 | 32 | 32 | 2.16 | 2.16 | 1759.87 | 1771.58 | 2158.04 | 3.408 |
  107 +| 8 | 32 | 32 | 2.15 | 2.15 | 3284.44 | 3658.45 | 3971.01 | 3.408 |
  108 +| 16 | 32 | 32 | 2.14 | 2.14 | 5669.15 | 7117.7 | 7522.48 | 3.408 |
  109 +| 64 | 32 | 32 | 2.14 | 2.14 | 7631.14 | 7510.97 | 14139.03 | 3.408 |
  110 +
  111 +### Batch x Concurrency Matrix
  112 +
  113 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  114 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  115 +| 1 | 1 | 32 | 32 | 2.17 | 2.17 | 461.18 | 439.28 | 673.59 | 3.408 |
  116 +| 1 | 2 | 32 | 32 | 2.15 | 2.15 | 919.12 | 904.83 | 1209.83 | 3.408 |
  117 +| 1 | 4 | 32 | 32 | 2.14 | 2.14 | 1771.34 | 1810.99 | 2171.71 | 3.408 |
  118 +| 1 | 8 | 32 | 32 | 2.12 | 2.12 | 3324.85 | 3704.57 | 3999.29 | 3.408 |
  119 +| 1 | 16 | 32 | 32 | 2.15 | 2.15 | 5651.09 | 7121.8 | 7495.67 | 3.408 |
  120 +| 1 | 64 | 32 | 32 | 2.15 | 2.15 | 7619.16 | 7505.14 | 14111.53 | 3.408 |
  121 +| 4 | 1 | 128 | 32 | 5.13 | 1.28 | 780.0 | 661.74 | 1586.9 | 3.408 |
  122 +| 4 | 2 | 128 | 32 | 5.09 | 1.27 | 1551.81 | 1368.81 | 2775.92 | 3.408 |
  123 +| 4 | 4 | 128 | 32 | 5.13 | 1.28 | 2995.57 | 2836.77 | 4532.67 | 3.408 |
  124 +| 4 | 8 | 128 | 32 | 5.16 | 1.29 | 5619.47 | 6046.81 | 7767.22 | 3.408 |
  125 +| 4 | 16 | 128 | 32 | 5.15 | 1.29 | 9546.3 | 12146.68 | 14537.41 | 3.408 |
  126 +| 8 | 1 | 256 | 32 | 8.36 | 1.04 | 957.45 | 784.85 | 1591.01 | 3.408 |
  127 +| 8 | 2 | 256 | 32 | 8.29 | 1.04 | 1906.56 | 1847.53 | 2799.1 | 3.408 |
  128 +| 8 | 4 | 256 | 32 | 8.3 | 1.04 | 3691.59 | 3773.67 | 4849.58 | 3.408 |
  129 +| 8 | 8 | 256 | 32 | 8.29 | 1.04 | 6954.22 | 7681.7 | 8904.0 | 3.408 |
  130 +| 8 | 16 | 256 | 32 | 8.31 | 1.04 | 11923.83 | 14903.17 | 17072.19 | 3.408 |
  131 +| 16 | 1 | 512 | 32 | 13.83 | 0.86 | 1156.84 | 904.4 | 1609.17 | 3.408 |
  132 +| 16 | 2 | 512 | 32 | 13.91 | 0.87 | 2276.35 | 2356.15 | 3121.45 | 3.408 |
  133 +| 16 | 4 | 512 | 32 | 13.89 | 0.87 | 4440.77 | 4733.34 | 5488.65 | 3.408 |
  134 +| 16 | 8 | 512 | 32 | 13.83 | 0.86 | 8428.17 | 9407.5 | 10409.59 | 3.408 |
  135 +| 32 | 1 | 1024 | 32 | 23.28 | 0.73 | 1374.39 | 1603.35 | 1806.99 | 3.408 |
  136 +| 32 | 2 | 1024 | 32 | 23.26 | 0.73 | 2721.28 | 2607.91 | 3437.08 | 3.408 |
  137 +| 32 | 4 | 1024 | 32 | 23.18 | 0.72 | 5297.48 | 5234.53 | 6758.76 | 3.408 |
  138 +| 64 | 1 | 2048 | 32 | 34.97 | 0.55 | 1829.91 | 1899.54 | 2039.18 | 3.69 |
  139 +| 64 | 2 | 2048 | 32 | 34.78 | 0.54 | 3615.7 | 3795.36 | 4014.19 | 3.69 |
  140 +
  141 +## opus-mt-zh-en zh->en
  142 +
  143 +- Direction: `zh -> en`
  144 +- Column: `title_cn`
  145 +- Loaded rows: `2048`
  146 +- Load time: `3.2561 s`
  147 +- Device: `cuda`
  148 +- DType: `torch.float16`
  149 +- Cache disabled: `True`
  150 +
  151 +### Batch Sweep (`concurrency=1`)
  152 +
  153 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  154 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  155 +| 1 | 256 | 256 | 6.15 | 6.15 | 162.53 | 145.96 | 274.74 | 0.285 |
  156 +| 4 | 256 | 64 | 15.34 | 3.83 | 260.77 | 231.11 | 356.0 | 0.306 |
  157 +| 8 | 256 | 32 | 25.51 | 3.19 | 313.61 | 271.09 | 379.84 | 0.324 |
  158 +| 16 | 256 | 16 | 41.44 | 2.59 | 386.06 | 286.4 | 797.93 | 0.366 |
  159 +| 32 | 256 | 8 | 54.36 | 1.7 | 588.7 | 330.13 | 1693.31 | 0.461 |
  160 +| 64 | 256 | 4 | 70.15 | 1.1 | 912.33 | 447.93 | 2161.59 | 0.642 |
  161 +
  162 +### Concurrency Sweep (`batch_size=1`)
  163 +
  164 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  165 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  166 +| 1 | 32 | 32 | 9.21 | 9.21 | 108.53 | 91.7 | 179.12 | 0.642 |
  167 +| 2 | 32 | 32 | 8.92 | 8.92 | 219.19 | 212.29 | 305.34 | 0.642 |
  168 +| 4 | 32 | 32 | 9.09 | 9.09 | 411.76 | 420.08 | 583.97 | 0.642 |
  169 +| 8 | 32 | 32 | 8.85 | 8.85 | 784.14 | 835.73 | 1043.06 | 0.642 |
  170 +| 16 | 32 | 32 | 9.01 | 9.01 | 1278.4 | 1483.34 | 1994.56 | 0.642 |
  171 +| 64 | 32 | 32 | 8.82 | 8.82 | 1687.08 | 1563.48 | 3381.58 | 0.642 |
  172 +
  173 +### Batch x Concurrency Matrix
  174 +
  175 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  176 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  177 +| 1 | 1 | 32 | 32 | 9.01 | 9.01 | 110.96 | 97.37 | 191.91 | 0.642 |
  178 +| 1 | 2 | 32 | 32 | 9.2 | 9.2 | 212.43 | 191.27 | 299.65 | 0.642 |
  179 +| 1 | 4 | 32 | 32 | 8.86 | 8.86 | 423.8 | 423.61 | 596.78 | 0.642 |
  180 +| 1 | 8 | 32 | 32 | 9.04 | 9.04 | 764.39 | 810.05 | 1035.74 | 0.642 |
  181 +| 1 | 16 | 32 | 32 | 9.06 | 9.06 | 1272.52 | 1468.35 | 1998.99 | 0.642 |
  182 +| 1 | 64 | 32 | 32 | 9.04 | 9.04 | 1663.7 | 1556.13 | 3296.1 | 0.642 |
  183 +| 4 | 1 | 128 | 32 | 15.17 | 3.79 | 263.61 | 201.11 | 369.93 | 0.642 |
  184 +| 4 | 2 | 128 | 32 | 15.25 | 3.81 | 517.25 | 397.13 | 1319.45 | 0.642 |
  185 +| 4 | 4 | 128 | 32 | 15.27 | 3.82 | 899.61 | 746.76 | 1914.27 | 0.642 |
  186 +| 4 | 8 | 128 | 32 | 15.39 | 3.85 | 1539.87 | 1466.41 | 2918.85 | 0.642 |
  187 +| 4 | 16 | 128 | 32 | 15.48 | 3.87 | 2455.89 | 2728.17 | 4644.99 | 0.642 |
  188 +| 8 | 1 | 256 | 32 | 25.44 | 3.18 | 314.41 | 272.16 | 386.05 | 0.642 |
  189 +| 8 | 2 | 256 | 32 | 25.42 | 3.18 | 620.41 | 541.88 | 1418.79 | 0.642 |
  190 +| 8 | 4 | 256 | 32 | 24.96 | 3.12 | 1226.44 | 1070.41 | 2940.56 | 0.642 |
  191 +| 8 | 8 | 256 | 32 | 25.44 | 3.18 | 2252.81 | 2148.42 | 4143.66 | 0.642 |
  192 +| 8 | 16 | 256 | 32 | 25.44 | 3.18 | 3961.06 | 5028.31 | 6340.79 | 0.642 |
  193 +| 16 | 1 | 512 | 32 | 31.66 | 1.98 | 505.38 | 321.69 | 1963.58 | 0.653 |
  194 +| 16 | 2 | 512 | 32 | 31.41 | 1.96 | 1009.12 | 677.84 | 2341.01 | 0.653 |
  195 +| 16 | 4 | 512 | 32 | 31.17 | 1.95 | 2006.54 | 1562.68 | 4299.85 | 0.653 |
  196 +| 16 | 8 | 512 | 32 | 31.51 | 1.97 | 3483.96 | 4085.09 | 5902.32 | 0.653 |
  197 +| 32 | 1 | 1024 | 32 | 38.66 | 1.21 | 827.7 | 409.78 | 2162.48 | 0.748 |
  198 +| 32 | 2 | 1024 | 32 | 38.75 | 1.21 | 1613.28 | 916.1 | 3228.1 | 0.748 |
  199 +| 32 | 4 | 1024 | 32 | 38.66 | 1.21 | 3162.05 | 3239.36 | 4992.07 | 0.748 |
  200 +| 64 | 1 | 2048 | 32 | 52.44 | 0.82 | 1220.35 | 533.4 | 2508.12 | 0.939 |
  201 +| 64 | 2 | 2048 | 32 | 52.04 | 0.81 | 2443.4 | 2800.35 | 4765.45 | 0.939 |
  202 +
  203 +## opus-mt-en-zh en->zh
  204 +
  205 +- Direction: `en -> zh`
  206 +- Column: `title`
  207 +- Loaded rows: `2048`
  208 +- Load time: `3.1612 s`
  209 +- Device: `cuda`
  210 +- DType: `torch.float16`
  211 +- Cache disabled: `True`
  212 +
  213 +### Batch Sweep (`concurrency=1`)
  214 +
  215 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  216 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  217 +| 1 | 256 | 256 | 4.53 | 4.53 | 220.59 | 177.37 | 411.57 | 0.285 |
  218 +| 4 | 256 | 64 | 10.12 | 2.53 | 395.37 | 319.23 | 761.49 | 0.307 |
  219 +| 8 | 256 | 32 | 14.63 | 1.83 | 546.88 | 390.81 | 1930.85 | 0.326 |
  220 +| 16 | 256 | 16 | 24.33 | 1.52 | 657.6 | 451.22 | 2098.54 | 0.37 |
  221 +| 32 | 256 | 8 | 33.91 | 1.06 | 943.59 | 603.45 | 2152.28 | 0.462 |
  222 +| 64 | 256 | 4 | 42.47 | 0.66 | 1507.01 | 1531.43 | 2371.85 | 0.644 |
  223 +
  224 +### Concurrency Sweep (`batch_size=1`)
  225 +
  226 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  227 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  228 +| 1 | 32 | 32 | 3.6 | 3.6 | 277.73 | 145.85 | 1180.37 | 0.644 |
  229 +| 2 | 32 | 32 | 3.55 | 3.55 | 559.38 | 346.71 | 1916.96 | 0.644 |
  230 +| 4 | 32 | 32 | 3.53 | 3.53 | 997.71 | 721.04 | 2944.17 | 0.644 |
  231 +| 8 | 32 | 32 | 3.51 | 3.51 | 1644.28 | 1590.93 | 3632.99 | 0.644 |
  232 +| 16 | 32 | 32 | 3.5 | 3.5 | 2600.18 | 2586.34 | 5554.04 | 0.644 |
  233 +| 64 | 32 | 32 | 3.52 | 3.52 | 3366.52 | 2780.0 | 7950.41 | 0.644 |
  234 +
  235 +### Batch x Concurrency Matrix
  236 +
  237 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  238 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  239 +| 1 | 1 | 32 | 32 | 3.54 | 3.54 | 282.22 | 148.26 | 1200.93 | 0.644 |
  240 +| 1 | 2 | 32 | 32 | 3.52 | 3.52 | 563.85 | 346.8 | 1957.63 | 0.644 |
  241 +| 1 | 4 | 32 | 32 | 3.52 | 3.52 | 1002.79 | 728.36 | 2949.99 | 0.644 |
  242 +| 1 | 8 | 32 | 32 | 3.55 | 3.55 | 1611.52 | 1462.84 | 3653.34 | 0.644 |
  243 +| 1 | 16 | 32 | 32 | 3.49 | 3.49 | 2582.77 | 2594.49 | 5504.07 | 0.644 |
  244 +| 1 | 64 | 32 | 32 | 3.53 | 3.53 | 3327.65 | 2753.11 | 7907.54 | 0.644 |
  245 +| 4 | 1 | 128 | 32 | 10.31 | 2.58 | 387.75 | 329.42 | 675.07 | 0.644 |
  246 +| 4 | 2 | 128 | 32 | 10.12 | 2.53 | 784.49 | 713.61 | 1561.27 | 0.644 |
  247 +| 4 | 4 | 128 | 32 | 10.08 | 2.52 | 1543.42 | 1438.42 | 2735.85 | 0.644 |
  248 +| 4 | 8 | 128 | 32 | 10.09 | 2.52 | 2918.06 | 2954.39 | 4327.47 | 0.644 |
  249 +| 4 | 16 | 128 | 32 | 10.16 | 2.54 | 5087.62 | 5734.85 | 7343.34 | 0.644 |
  250 +| 8 | 1 | 256 | 32 | 14.49 | 1.81 | 552.16 | 402.18 | 1963.8 | 0.644 |
  251 +| 8 | 2 | 256 | 32 | 14.54 | 1.82 | 1088.96 | 845.04 | 2565.11 | 0.644 |
  252 +| 8 | 4 | 256 | 32 | 14.48 | 1.81 | 2143.18 | 1832.69 | 4487.97 | 0.644 |
  253 +| 8 | 8 | 256 | 32 | 14.51 | 1.81 | 4051.04 | 3734.16 | 5976.18 | 0.644 |
  254 +| 8 | 16 | 256 | 32 | 14.55 | 1.82 | 6626.6 | 6868.91 | 9395.9 | 0.644 |
  255 +| 16 | 1 | 512 | 32 | 19.11 | 1.19 | 837.36 | 457.51 | 2030.2 | 0.661 |
  256 +| 16 | 2 | 512 | 32 | 19.26 | 1.2 | 1599.87 | 1141.06 | 2802.44 | 0.661 |
  257 +| 16 | 4 | 512 | 32 | 19.1 | 1.19 | 3130.81 | 3166.45 | 4987.49 | 0.661 |
  258 +| 16 | 8 | 512 | 32 | 19.27 | 1.2 | 5735.13 | 5351.82 | 8417.85 | 0.661 |
  259 +| 32 | 1 | 1024 | 32 | 26.04 | 0.81 | 1228.66 | 648.06 | 2167.39 | 0.751 |
  260 +| 32 | 2 | 1024 | 32 | 26.21 | 0.82 | 2372.52 | 2593.74 | 4206.96 | 0.751 |
  261 +| 32 | 4 | 1024 | 32 | 26.11 | 0.82 | 4499.96 | 3977.65 | 6920.57 | 0.751 |
  262 +| 64 | 1 | 2048 | 32 | 34.94 | 0.55 | 1831.48 | 2366.24 | 2473.74 | 0.942 |
  263 +| 64 | 2 | 2048 | 32 | 34.75 | 0.54 | 3603.99 | 3281.6 | 4948.72 | 0.942 |
... ...
scripts/benchmark_translation_local_models.py
... ... @@ -4,6 +4,7 @@
4 4 from __future__ import annotations
5 5  
6 6 import argparse
  7 +import concurrent.futures
7 8 import copy
8 9 import csv
9 10 import json
... ... @@ -16,7 +17,7 @@ import sys
16 17 import time
17 18 from datetime import datetime
18 19 from pathlib import Path
19   -from typing import Any, Dict, Iterable, List
  20 +from typing import Any, Dict, Iterable, List, Sequence
20 21  
21 22 import torch
22 23 import transformers
... ... @@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402
30 31 from translation.settings import get_translation_capability # noqa: E402
31 32  
32 33  
  34 +DEFAULT_BATCH_SIZES = [1, 4, 8, 16, 32, 64]
  35 +DEFAULT_CONCURRENCIES = [1, 2, 4, 8, 16, 64]
  36 +
33 37 SCENARIOS: List[Dict[str, str]] = [
34 38 {
35 39 "name": "nllb-200-distilled-600m zh->en",
... ... @@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [
69 73 def parse_args() -> argparse.Namespace:
70 74 parser = argparse.ArgumentParser(description="Benchmark local translation models")
71 75 parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path")
72   - parser.add_argument("--limit", type=int, default=0, help="Limit rows for faster experiments; 0 means all")
  76 + parser.add_argument("--limit", type=int, default=0, help="Limit rows for baseline or single-case run; 0 means all")
73 77 parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports")
74 78 parser.add_argument("--single", action="store_true", help="Run a single scenario in-process")
75 79 parser.add_argument("--model", default="", help="Model name for --single mode")
... ... @@ -84,9 +88,67 @@ def parse_args() -&gt; argparse.Namespace:
84 88 parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams")
85 89 parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa")
86 90 parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring")
  91 + parser.add_argument("--disable-cache", action="store_true", help="Disable translation cache during benchmarks")
  92 + parser.add_argument(
  93 + "--suite",
  94 + choices=["baseline", "extended"],
  95 + default="baseline",
  96 + help="baseline keeps the previous all-scenarios summary; extended adds batch/concurrency/matrix sweeps",
  97 + )
  98 + parser.add_argument(
  99 + "--batch-size-list",
  100 + default="",
  101 + help="Comma-separated batch sizes for extended suite; default 1,4,8,16,32,64",
  102 + )
  103 + parser.add_argument(
  104 + "--concurrency-list",
  105 + default="",
  106 + help="Comma-separated concurrency levels for extended suite; default 1,2,4,8,16,64",
  107 + )
  108 + parser.add_argument(
  109 + "--serial-items-per-case",
  110 + type=int,
  111 + default=512,
  112 + help="Items per batch-size case in extended suite",
  113 + )
  114 + parser.add_argument(
  115 + "--concurrency-requests-per-case",
  116 + type=int,
  117 + default=128,
  118 + help="Requests per concurrency or matrix case in extended suite",
  119 + )
  120 + parser.add_argument(
  121 + "--concurrency-batch-size",
  122 + type=int,
  123 + default=1,
  124 + help="Batch size used by the dedicated concurrency sweep",
  125 + )
  126 + parser.add_argument(
  127 + "--max-batch-concurrency-product",
  128 + type=int,
  129 + default=128,
  130 + help="Skip matrix cases where batch_size * concurrency exceeds this value; 0 disables the limit",
  131 + )
87 132 return parser.parse_args()
88 133  
89 134  
  135 +def parse_csv_ints(raw: str, fallback: Sequence[int]) -> List[int]:
  136 + if not raw.strip():
  137 + return list(fallback)
  138 + values: List[int] = []
  139 + for item in raw.split(","):
  140 + stripped = item.strip()
  141 + if not stripped:
  142 + continue
  143 + value = int(stripped)
  144 + if value <= 0:
  145 + raise ValueError(f"Expected positive integer, got {value}")
  146 + values.append(value)
  147 + if not values:
  148 + raise ValueError("Parsed empty integer list")
  149 + return values
  150 +
  151 +
90 152 def load_texts(csv_path: Path, column: str, limit: int) -> List[str]:
91 153 texts: List[str] = []
92 154 with csv_path.open("r", encoding="utf-8") as handle:
... ... @@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -&gt; List[str]:
102 164 return texts
103 165  
104 166  
105   -def batched(values: List[str], batch_size: int) -> Iterable[List[str]]:
  167 +def batched(values: Sequence[str], batch_size: int) -> Iterable[List[str]]:
106 168 for start in range(0, len(values), batch_size):
107   - yield values[start:start + batch_size]
  169 + yield list(values[start:start + batch_size])
108 170  
109 171  
110 172 def percentile(values: List[float], p: float) -> float:
... ... @@ -148,15 +210,34 @@ def build_environment_info() -&gt; Dict[str, Any]:
148 210 }
149 211  
150 212  
151   -def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]:
152   - csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
  213 +def scenario_from_args(args: argparse.Namespace) -> Dict[str, str]:
  214 + return {
  215 + "name": f"{args.model} {args.source_lang}->{args.target_lang}",
  216 + "model": args.model,
  217 + "source_lang": args.source_lang,
  218 + "target_lang": args.target_lang,
  219 + "column": args.column,
  220 + "scene": args.scene,
  221 + }
  222 +
  223 +
  224 +def build_config_and_capability(
  225 + args: argparse.Namespace,
  226 + *,
  227 + batch_size_override: int | None = None,
  228 +) -> tuple[Dict[str, Any], Dict[str, Any]]:
153 229 config = copy.deepcopy(get_translation_config())
  230 + for name, cfg in config["capabilities"].items():
  231 + cfg["enabled"] = name == args.model
  232 + config["default_model"] = args.model
154 233 capability = get_translation_capability(config, args.model, require_enabled=False)
155 234 if args.device_override:
156 235 capability["device"] = args.device_override
157 236 if args.torch_dtype_override:
158 237 capability["torch_dtype"] = args.torch_dtype_override
159   - if args.batch_size:
  238 + if batch_size_override is not None:
  239 + capability["batch_size"] = batch_size_override
  240 + elif args.batch_size:
160 241 capability["batch_size"] = args.batch_size
161 242 if args.max_new_tokens:
162 243 capability["max_new_tokens"] = args.max_new_tokens
... ... @@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]:
164 245 capability["num_beams"] = args.num_beams
165 246 if args.attn_implementation:
166 247 capability["attn_implementation"] = args.attn_implementation
  248 + if args.disable_cache:
  249 + capability["use_cache"] = False
167 250 config["capabilities"][args.model] = capability
168   - configured_batch_size = int(capability.get("batch_size") or 1)
169   - batch_size = configured_batch_size
170   - texts = load_texts(csv_path, args.column, args.limit)
  251 + return config, capability
171 252  
172   - service = TranslationService(config)
  253 +
  254 +def ensure_cuda_stats_reset() -> None:
173 255 if torch.cuda.is_available():
174 256 torch.cuda.empty_cache()
175 257 torch.cuda.reset_peak_memory_stats()
176 258  
177   - load_start = time.perf_counter()
178   - backend = service.get_backend(args.model)
179   - load_seconds = time.perf_counter() - load_start
180 259  
181   - warmup_batches = min(max(args.warmup_batches, 0), max(1, math.ceil(len(texts) / batch_size)))
182   - for batch in list(batched(texts, batch_size))[:warmup_batches]:
  260 +def build_memory_metrics() -> Dict[str, Any]:
  261 + peak_gpu_mem_gb = None
  262 + peak_gpu_reserved_gb = None
  263 + if torch.cuda.is_available():
  264 + peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3)
  265 + peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3)
  266 + max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2)
  267 + return {
  268 + "max_rss_mb": max_rss_mb,
  269 + "peak_gpu_memory_gb": peak_gpu_mem_gb,
  270 + "peak_gpu_reserved_gb": peak_gpu_reserved_gb,
  271 + }
  272 +
  273 +
  274 +def make_request_payload(batch: Sequence[str]) -> str | List[str]:
  275 + if len(batch) == 1:
  276 + return batch[0]
  277 + return list(batch)
  278 +
  279 +
  280 +def benchmark_serial_case(
  281 + *,
  282 + service: TranslationService,
  283 + backend: Any,
  284 + scenario: Dict[str, str],
  285 + capability: Dict[str, Any],
  286 + texts: List[str],
  287 + batch_size: int,
  288 + warmup_batches: int,
  289 +) -> Dict[str, Any]:
  290 + backend.batch_size = batch_size
  291 + measured_batches = list(batched(texts, batch_size))
  292 + warmup_count = min(max(warmup_batches, 0), len(measured_batches))
  293 +
  294 + for batch in measured_batches[:warmup_count]:
183 295 service.translate(
184   - text=batch,
185   - source_lang=args.source_lang,
186   - target_lang=args.target_lang,
187   - model=args.model,
188   - scene=args.scene,
  296 + text=make_request_payload(batch),
  297 + source_lang=scenario["source_lang"],
  298 + target_lang=scenario["target_lang"],
  299 + model=scenario["model"],
  300 + scene=scenario["scene"],
189 301 )
190 302  
191 303 batch_latencies_ms: List[float] = []
... ... @@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]:
193 305 failure_count = 0
194 306 output_chars = 0
195 307 total_input_chars = sum(len(text) for text in texts)
196   - measured_batches = list(batched(texts, batch_size))
197 308  
198 309 start = time.perf_counter()
199 310 for batch in measured_batches:
200 311 batch_start = time.perf_counter()
201 312 outputs = service.translate(
202   - text=batch,
203   - source_lang=args.source_lang,
204   - target_lang=args.target_lang,
205   - model=args.model,
206   - scene=args.scene,
  313 + text=make_request_payload(batch),
  314 + source_lang=scenario["source_lang"],
  315 + target_lang=scenario["target_lang"],
  316 + model=scenario["model"],
  317 + scene=scenario["scene"],
207 318 )
208 319 elapsed_ms = (time.perf_counter() - batch_start) * 1000
209 320 batch_latencies_ms.append(elapsed_ms)
210 321  
211   - if not isinstance(outputs, list):
212   - raise RuntimeError(f"Expected list output for batch translation, got {type(outputs)!r}")
213   - for item in outputs:
  322 + if isinstance(outputs, list):
  323 + result_items = outputs
  324 + else:
  325 + result_items = [outputs]
  326 + for item in result_items:
214 327 if item is None:
215 328 failure_count += 1
216 329 else:
217 330 success_count += 1
218 331 output_chars += len(item)
219 332 translate_seconds = time.perf_counter() - start
  333 + total_items = len(texts)
  334 + memory = build_memory_metrics()
220 335  
221   - peak_gpu_mem_gb = None
222   - peak_gpu_reserved_gb = None
223   - if torch.cuda.is_available():
224   - peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3)
225   - peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3)
  336 + return {
  337 + "mode": "serial_batch",
  338 + "batch_size": batch_size,
  339 + "concurrency": 1,
  340 + "rows": total_items,
  341 + "requests": len(measured_batches),
  342 + "input_chars": total_input_chars,
  343 + "load_seconds": 0.0,
  344 + "translate_seconds": round(translate_seconds, 4),
  345 + "total_seconds": round(translate_seconds, 4),
  346 + "batch_count": len(batch_latencies_ms),
  347 + "request_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2),
  348 + "request_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2),
  349 + "request_latency_max_ms": round(max(batch_latencies_ms), 2),
  350 + "avg_request_latency_ms": round(statistics.fmean(batch_latencies_ms), 2),
  351 + "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3),
  352 + "requests_per_second": round(len(measured_batches) / translate_seconds, 2),
  353 + "items_per_second": round(total_items / translate_seconds, 2),
  354 + "input_chars_per_second": round(total_input_chars / translate_seconds, 2),
  355 + "output_chars_per_second": round(output_chars / translate_seconds, 2),
  356 + "success_count": success_count,
  357 + "failure_count": failure_count,
  358 + "success_rate": round(success_count / total_items, 6),
  359 + "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
  360 + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
  361 + "configured_batch_size": int(capability.get("batch_size") or batch_size),
  362 + "used_batch_size": batch_size,
  363 + "warmup_batches": warmup_count,
  364 + **memory,
  365 + }
226 366  
227   - max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2)
228   - total_items = len(texts)
  367 +
  368 +def benchmark_concurrency_case(
  369 + *,
  370 + service: TranslationService,
  371 + backend: Any,
  372 + scenario: Dict[str, str],
  373 + capability: Dict[str, Any],
  374 + texts: List[str],
  375 + batch_size: int,
  376 + concurrency: int,
  377 + requests_per_case: int,
  378 + warmup_batches: int,
  379 +) -> Dict[str, Any]:
  380 + backend.batch_size = batch_size
  381 + required_items = batch_size * requests_per_case
  382 + case_texts = texts[:required_items]
  383 + request_batches = list(batched(case_texts, batch_size))
  384 + if not request_batches:
  385 + raise ValueError("No request batches prepared for concurrency benchmark")
  386 + warmup_count = min(max(warmup_batches, 0), len(request_batches))
  387 +
  388 + for batch in request_batches[:warmup_count]:
  389 + service.translate(
  390 + text=make_request_payload(batch),
  391 + source_lang=scenario["source_lang"],
  392 + target_lang=scenario["target_lang"],
  393 + model=scenario["model"],
  394 + scene=scenario["scene"],
  395 + )
  396 +
  397 + request_latencies_ms: List[float] = []
  398 + success_count = 0
  399 + failure_count = 0
  400 + output_chars = 0
  401 + total_input_chars = sum(len(text) for text in case_texts)
  402 +
  403 + def worker(batch: List[str]) -> tuple[float, int, int, int]:
  404 + started = time.perf_counter()
  405 + outputs = service.translate(
  406 + text=make_request_payload(batch),
  407 + source_lang=scenario["source_lang"],
  408 + target_lang=scenario["target_lang"],
  409 + model=scenario["model"],
  410 + scene=scenario["scene"],
  411 + )
  412 + elapsed_ms = (time.perf_counter() - started) * 1000
  413 + if isinstance(outputs, list):
  414 + result_items = outputs
  415 + else:
  416 + result_items = [outputs]
  417 + local_success = 0
  418 + local_failure = 0
  419 + local_output_chars = 0
  420 + for item in result_items:
  421 + if item is None:
  422 + local_failure += 1
  423 + else:
  424 + local_success += 1
  425 + local_output_chars += len(item)
  426 + return elapsed_ms, local_success, local_failure, local_output_chars
  427 +
  428 + wall_start = time.perf_counter()
  429 + with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
  430 + futures = [executor.submit(worker, batch) for batch in request_batches]
  431 + for future in concurrent.futures.as_completed(futures):
  432 + latency_ms, local_success, local_failure, local_output_chars = future.result()
  433 + request_latencies_ms.append(latency_ms)
  434 + success_count += local_success
  435 + failure_count += local_failure
  436 + output_chars += local_output_chars
  437 + wall_seconds = time.perf_counter() - wall_start
  438 + total_items = len(case_texts)
  439 + memory = build_memory_metrics()
229 440  
230 441 return {
231   - "scenario": {
232   - "name": f"{args.model} {args.source_lang}->{args.target_lang}",
233   - "model": args.model,
234   - "source_lang": args.source_lang,
235   - "target_lang": args.target_lang,
236   - "column": args.column,
237   - "scene": args.scene,
  442 + "mode": "concurrency",
  443 + "batch_size": batch_size,
  444 + "concurrency": concurrency,
  445 + "rows": total_items,
  446 + "requests": len(request_batches),
  447 + "input_chars": total_input_chars,
  448 + "load_seconds": 0.0,
  449 + "translate_seconds": round(wall_seconds, 4),
  450 + "total_seconds": round(wall_seconds, 4),
  451 + "batch_count": len(request_latencies_ms),
  452 + "request_latency_p50_ms": round(percentile(request_latencies_ms, 0.50), 2),
  453 + "request_latency_p95_ms": round(percentile(request_latencies_ms, 0.95), 2),
  454 + "request_latency_max_ms": round(max(request_latencies_ms), 2),
  455 + "avg_request_latency_ms": round(statistics.fmean(request_latencies_ms), 2),
  456 + "avg_item_latency_ms": round((wall_seconds / total_items) * 1000, 3),
  457 + "requests_per_second": round(len(request_batches) / wall_seconds, 2),
  458 + "items_per_second": round(total_items / wall_seconds, 2),
  459 + "input_chars_per_second": round(total_input_chars / wall_seconds, 2),
  460 + "output_chars_per_second": round(output_chars / wall_seconds, 2),
  461 + "success_count": success_count,
  462 + "failure_count": failure_count,
  463 + "success_rate": round(success_count / total_items, 6),
  464 + "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
  465 + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
  466 + "configured_batch_size": int(capability.get("batch_size") or batch_size),
  467 + "used_batch_size": batch_size,
  468 + "warmup_batches": warmup_count,
  469 + **memory,
  470 + }
  471 +
  472 +
  473 +def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]:
  474 + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
  475 + scenario = scenario_from_args(args)
  476 + config, capability = build_config_and_capability(args)
  477 + configured_batch_size = int(capability.get("batch_size") or 1)
  478 + batch_size = configured_batch_size
  479 + texts = load_texts(csv_path, args.column, args.limit)
  480 +
  481 + ensure_cuda_stats_reset()
  482 + load_start = time.perf_counter()
  483 + service = TranslationService(config)
  484 + backend = service.get_backend(args.model)
  485 + load_seconds = time.perf_counter() - load_start
  486 +
  487 + runtime = benchmark_serial_case(
  488 + service=service,
  489 + backend=backend,
  490 + scenario=scenario,
  491 + capability=capability,
  492 + texts=texts,
  493 + batch_size=batch_size,
  494 + warmup_batches=args.warmup_batches,
  495 + )
  496 + runtime["load_seconds"] = round(load_seconds, 4)
  497 + runtime["total_seconds"] = round(runtime["load_seconds"] + runtime["translate_seconds"], 4)
  498 +
  499 + return {
  500 + "scenario": scenario,
  501 + "dataset": {
  502 + "csv_path": str(csv_path),
  503 + "rows": len(texts),
  504 + "input_chars": sum(len(text) for text in texts),
238 505 },
  506 + "runtime": runtime,
  507 + }
  508 +
  509 +
  510 +def benchmark_extended_scenario(args: argparse.Namespace) -> Dict[str, Any]:
  511 + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
  512 + scenario = scenario_from_args(args)
  513 + batch_sizes = parse_csv_ints(args.batch_size_list, DEFAULT_BATCH_SIZES)
  514 + concurrencies = parse_csv_ints(args.concurrency_list, DEFAULT_CONCURRENCIES)
  515 + largest_batch = max(batch_sizes + [args.concurrency_batch_size])
  516 + largest_concurrency = max(concurrencies)
  517 + max_product = args.max_batch_concurrency_product
  518 + required_items = max(
  519 + args.limit or 0,
  520 + max(args.serial_items_per_case, largest_batch),
  521 + args.concurrency_requests_per_case * args.concurrency_batch_size,
  522 + largest_batch * args.concurrency_requests_per_case,
  523 + )
  524 + texts = load_texts(csv_path, args.column, required_items)
  525 + config, capability = build_config_and_capability(args)
  526 +
  527 + ensure_cuda_stats_reset()
  528 + load_start = time.perf_counter()
  529 + service = TranslationService(config)
  530 + backend = service.get_backend(args.model)
  531 + load_seconds = time.perf_counter() - load_start
  532 +
  533 + batch_sweep: List[Dict[str, Any]] = []
  534 + concurrency_sweep: List[Dict[str, Any]] = []
  535 + matrix_results: List[Dict[str, Any]] = []
  536 +
  537 + for batch_size in batch_sizes:
  538 + case_texts = texts[: max(batch_size, args.serial_items_per_case)]
  539 + batch_sweep.append(
  540 + benchmark_serial_case(
  541 + service=service,
  542 + backend=backend,
  543 + scenario=scenario,
  544 + capability=capability,
  545 + texts=case_texts,
  546 + batch_size=batch_size,
  547 + warmup_batches=args.warmup_batches,
  548 + )
  549 + )
  550 +
  551 + for concurrency in concurrencies:
  552 + concurrency_sweep.append(
  553 + benchmark_concurrency_case(
  554 + service=service,
  555 + backend=backend,
  556 + scenario=scenario,
  557 + capability=capability,
  558 + texts=texts,
  559 + batch_size=args.concurrency_batch_size,
  560 + concurrency=concurrency,
  561 + requests_per_case=args.concurrency_requests_per_case,
  562 + warmup_batches=args.warmup_batches,
  563 + )
  564 + )
  565 +
  566 + for batch_size in batch_sizes:
  567 + for concurrency in concurrencies:
  568 + if max_product > 0 and batch_size * concurrency > max_product:
  569 + continue
  570 + matrix_results.append(
  571 + benchmark_concurrency_case(
  572 + service=service,
  573 + backend=backend,
  574 + scenario=scenario,
  575 + capability=capability,
  576 + texts=texts,
  577 + batch_size=batch_size,
  578 + concurrency=concurrency,
  579 + requests_per_case=args.concurrency_requests_per_case,
  580 + warmup_batches=args.warmup_batches,
  581 + )
  582 + )
  583 +
  584 + for collection in (batch_sweep, concurrency_sweep, matrix_results):
  585 + for idx, item in enumerate(collection):
  586 + item["load_seconds"] = round(load_seconds if idx == 0 else 0.0, 4)
  587 + item["total_seconds"] = round(item["load_seconds"] + item["translate_seconds"], 4)
  588 +
  589 + return {
  590 + "scenario": scenario,
239 591 "dataset": {
240 592 "csv_path": str(csv_path),
241   - "rows": total_items,
242   - "input_chars": total_input_chars,
  593 + "rows_loaded": len(texts),
  594 + },
  595 + "config": {
  596 + "batch_sizes": batch_sizes,
  597 + "concurrencies": concurrencies,
  598 + "serial_items_per_case": args.serial_items_per_case,
  599 + "concurrency_requests_per_case": args.concurrency_requests_per_case,
  600 + "concurrency_batch_size": args.concurrency_batch_size,
  601 + "max_batch_concurrency_product": max_product,
  602 + "cache_disabled": bool(args.disable_cache),
243 603 },
244   - "runtime": {
  604 + "runtime_defaults": {
245 605 "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
246 606 "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
247   - "configured_batch_size": configured_batch_size,
248   - "used_batch_size": batch_size,
249   - "warmup_batches": warmup_batches,
  607 + "configured_batch_size": int(capability.get("batch_size") or 1),
250 608 "load_seconds": round(load_seconds, 4),
251   - "translate_seconds": round(translate_seconds, 4),
252   - "total_seconds": round(load_seconds + translate_seconds, 4),
253   - "batch_count": len(batch_latencies_ms),
254   - "first_batch_ms": round(batch_latencies_ms[0], 2),
255   - "batch_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2),
256   - "batch_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2),
257   - "batch_latency_max_ms": round(max(batch_latencies_ms), 2),
258   - "avg_batch_latency_ms": round(statistics.fmean(batch_latencies_ms), 2),
259   - "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3),
260   - "items_per_second": round(total_items / translate_seconds, 2),
261   - "input_chars_per_second": round(total_input_chars / translate_seconds, 2),
262   - "output_chars_per_second": round(output_chars / translate_seconds, 2),
263   - "success_count": success_count,
264   - "failure_count": failure_count,
265   - "success_rate": round(success_count / total_items, 6),
266   - "max_rss_mb": max_rss_mb,
267   - "peak_gpu_memory_gb": peak_gpu_mem_gb,
268   - "peak_gpu_reserved_gb": peak_gpu_reserved_gb,
269 609 },
  610 + "batch_sweep": batch_sweep,
  611 + "concurrency_sweep": concurrency_sweep,
  612 + "matrix": matrix_results,
270 613 }
271 614  
272 615  
273 616 def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]:
274 617 report = {
275 618 "generated_at": datetime.now().isoformat(timespec="seconds"),
  619 + "suite": args.suite,
276 620 "environment": build_environment_info(),
277 621 "scenarios": [],
278 622 }
... ... @@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
296 640 scenario["scene"],
297 641 "--warmup-batches",
298 642 str(args.warmup_batches),
  643 + "--suite",
  644 + args.suite,
  645 + "--serial-items-per-case",
  646 + str(args.serial_items_per_case),
  647 + "--concurrency-requests-per-case",
  648 + str(args.concurrency_requests_per_case),
  649 + "--concurrency-batch-size",
  650 + str(args.concurrency_batch_size),
  651 + "--max-batch-concurrency-product",
  652 + str(args.max_batch_concurrency_product),
299 653 ]
300 654 if args.limit:
301 655 cmd.extend(["--limit", str(args.limit)])
302 656 if args.batch_size:
303 657 cmd.extend(["--batch-size", str(args.batch_size)])
  658 + if args.batch_size_list:
  659 + cmd.extend(["--batch-size-list", args.batch_size_list])
  660 + if args.concurrency_list:
  661 + cmd.extend(["--concurrency-list", args.concurrency_list])
304 662 if args.device_override:
305 663 cmd.extend(["--device-override", args.device_override])
306 664 if args.torch_dtype_override:
... ... @@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
311 669 cmd.extend(["--num-beams", str(args.num_beams)])
312 670 if args.attn_implementation:
313 671 cmd.extend(["--attn-implementation", args.attn_implementation])
  672 + if args.disable_cache:
  673 + cmd.append("--disable-cache")
314 674  
315 675 completed = subprocess.run(cmd, capture_output=True, text=True, check=True)
316 676 result_line = ""
... ... @@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
327 687 return report
328 688  
329 689  
330   -def render_markdown_report(report: Dict[str, Any]) -> str:
  690 +def render_baseline_markdown_report(report: Dict[str, Any]) -> str:
331 691 lines = [
332 692 "# Local Translation Model Benchmark",
333 693 "",
334 694 f"- Generated at: `{report['generated_at']}`",
  695 + f"- Suite: `{report['suite']}`",
335 696 f"- Python: `{report['environment']['python']}`",
336 697 f"- Torch: `{report['environment']['torch']}`",
337 698 f"- Transformers: `{report['environment']['transformers']}`",
... ... @@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
342 703 lines.extend(
343 704 [
344 705 "",
345   - "| Scenario | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Load s | Peak GPU GiB | Success |",
  706 + "| Scenario | Items/s | Avg item ms | Req p50 ms | Req p95 ms | Load s | Peak GPU GiB | Success |",
346 707 "|---|---:|---:|---:|---:|---:|---:|---:|",
347 708 ]
348 709 )
349 710 for item in report["scenarios"]:
350 711 runtime = item["runtime"]
351 712 lines.append(
352   - "| {name} | {items_per_second} | {avg_item_latency_ms} | {batch_latency_p50_ms} | {batch_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format(
  713 + "| {name} | {items_per_second} | {avg_item_latency_ms} | {request_latency_p50_ms} | {request_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format(
353 714 name=item["scenario"]["name"],
354 715 items_per_second=runtime["items_per_second"],
355 716 avg_item_latency_ms=runtime["avg_item_latency_ms"],
356   - batch_latency_p50_ms=runtime["batch_latency_p50_ms"],
357   - batch_latency_p95_ms=runtime["batch_latency_p95_ms"],
  717 + request_latency_p50_ms=runtime["request_latency_p50_ms"],
  718 + request_latency_p95_ms=runtime["request_latency_p95_ms"],
358 719 load_seconds=runtime["load_seconds"],
359 720 peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"],
360 721 success_rate=runtime["success_rate"],
... ... @@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
375 736 f"- Load time: `{runtime['load_seconds']} s`",
376 737 f"- Translate time: `{runtime['translate_seconds']} s`",
377 738 f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`",
378   - f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, batch p50 `{runtime['batch_latency_p50_ms']} ms`, batch p95 `{runtime['batch_latency_p95_ms']} ms`, batch max `{runtime['batch_latency_max_ms']} ms`",
  739 + f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, req p50 `{runtime['request_latency_p50_ms']} ms`, req p95 `{runtime['request_latency_p95_ms']} ms`, req max `{runtime['request_latency_max_ms']} ms`",
379 740 f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`",
380 741 f"- Success: `{runtime['success_count']}/{dataset['rows']}`",
381 742 "",
... ... @@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
384 745 return "\n".join(lines)
385 746  
386 747  
  748 +def render_case_table(
  749 + title: str,
  750 + rows: Sequence[Dict[str, Any]],
  751 + *,
  752 + include_batch: bool,
  753 + include_concurrency: bool,
  754 +) -> List[str]:
  755 + headers = ["Rows", "Requests", "Items/s", "Req/s", "Avg req ms", "Req p50 ms", "Req p95 ms", "Peak GPU GiB"]
  756 + prefix_headers: List[str] = []
  757 + if include_batch:
  758 + prefix_headers.append("Batch")
  759 + if include_concurrency:
  760 + prefix_headers.append("Concurrency")
  761 + headers = prefix_headers + headers
  762 + lines = [f"### {title}", ""]
  763 + lines.append("| " + " | ".join(headers) + " |")
  764 + lines.append("|" + "|".join(["---:"] * len(headers)) + "|")
  765 + for item in rows:
  766 + values: List[str] = []
  767 + if include_batch:
  768 + values.append(str(item["batch_size"]))
  769 + if include_concurrency:
  770 + values.append(str(item["concurrency"]))
  771 + values.extend(
  772 + [
  773 + str(item["rows"]),
  774 + str(item["requests"]),
  775 + str(item["items_per_second"]),
  776 + str(item["requests_per_second"]),
  777 + str(item["avg_request_latency_ms"]),
  778 + str(item["request_latency_p50_ms"]),
  779 + str(item["request_latency_p95_ms"]),
  780 + str(item["peak_gpu_memory_gb"]),
  781 + ]
  782 + )
  783 + lines.append("| " + " | ".join(values) + " |")
  784 + lines.append("")
  785 + return lines
  786 +
  787 +
  788 +def render_extended_markdown_report(report: Dict[str, Any]) -> str:
  789 + lines = [
  790 + "# Local Translation Model Extended Benchmark",
  791 + "",
  792 + f"- Generated at: `{report['generated_at']}`",
  793 + f"- Suite: `{report['suite']}`",
  794 + f"- Python: `{report['environment']['python']}`",
  795 + f"- Torch: `{report['environment']['torch']}`",
  796 + f"- Transformers: `{report['environment']['transformers']}`",
  797 + f"- CUDA: `{report['environment']['cuda_available']}`",
  798 + ]
  799 + if report["environment"]["gpu_name"]:
  800 + lines.append(f"- GPU: `{report['environment']['gpu_name']}` ({report['environment']['gpu_total_mem_gb']} GiB)")
  801 +
  802 + lines.extend(
  803 + [
  804 + "",
  805 + "## Reading Guide",
  806 + "",
  807 + "- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.",
  808 + "- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.",
  809 + "- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.",
  810 + "",
  811 + ]
  812 + )
  813 +
  814 + for item in report["scenarios"]:
  815 + lines.extend(
  816 + [
  817 + f"## {item['scenario']['name']}",
  818 + "",
  819 + f"- Direction: `{item['scenario']['source_lang']} -> {item['scenario']['target_lang']}`",
  820 + f"- Column: `{item['scenario']['column']}`",
  821 + f"- Loaded rows: `{item['dataset']['rows_loaded']}`",
  822 + f"- Load time: `{item['runtime_defaults']['load_seconds']} s`",
  823 + f"- Device: `{item['runtime_defaults']['device']}`",
  824 + f"- DType: `{item['runtime_defaults']['torch_dtype']}`",
  825 + f"- Cache disabled: `{item['config']['cache_disabled']}`",
  826 + "",
  827 + ]
  828 + )
  829 + lines.extend(render_case_table("Batch Sweep (`concurrency=1`)", item["batch_sweep"], include_batch=True, include_concurrency=False))
  830 + lines.extend(
  831 + render_case_table(
  832 + f"Concurrency Sweep (`batch_size={item['config']['concurrency_batch_size']}`)",
  833 + item["concurrency_sweep"],
  834 + include_batch=False,
  835 + include_concurrency=True,
  836 + )
  837 + )
  838 + lines.extend(render_case_table("Batch x Concurrency Matrix", item["matrix"], include_batch=True, include_concurrency=True))
  839 + return "\n".join(lines)
  840 +
  841 +
  842 +def render_markdown_report(report: Dict[str, Any]) -> str:
  843 + if report["suite"] == "extended":
  844 + return render_extended_markdown_report(report)
  845 + return render_baseline_markdown_report(report)
  846 +
  847 +
387 848 def main() -> None:
388 849 args = parse_args()
389 850 if args.single:
390   - result = benchmark_single_scenario(args)
  851 + if args.suite == "extended":
  852 + result = benchmark_extended_scenario(args)
  853 + else:
  854 + result = benchmark_single_scenario(args)
391 855 print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False))
392 856 return
393 857  
394 858 report = run_all_scenarios(args)
395 859 output_dir = resolve_output_dir(args.output_dir)
396 860 timestamp = datetime.now().strftime("%H%M%S")
397   - json_path = output_dir / f"translation_local_models_{timestamp}.json"
398   - md_path = output_dir / f"translation_local_models_{timestamp}.md"
  861 + suffix = "extended" if args.suite == "extended" else "baseline"
  862 + json_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.json"
  863 + md_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.md"
399 864 json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
400 865 md_path.write_text(render_markdown_report(report), encoding="utf-8")
401 866  
402 867 print(f"JSON report: {json_path}")
403 868 print(f"Markdown report: {md_path}")
404 869 for item in report["scenarios"]:
405   - runtime = item["runtime"]
406   - print(
407   - f"{item['scenario']['name']}: "
408   - f"{runtime['items_per_second']} items/s | "
409   - f"avg_item={runtime['avg_item_latency_ms']} ms | "
410   - f"p95_batch={runtime['batch_latency_p95_ms']} ms | "
411   - f"load={runtime['load_seconds']} s"
412   - )
  870 + if args.suite == "extended":
  871 + best_batch = max(item["batch_sweep"], key=lambda x: x["items_per_second"])
  872 + best_concurrency = max(item["concurrency_sweep"], key=lambda x: x["items_per_second"])
  873 + print(
  874 + f"{item['scenario']['name']}: "
  875 + f"best_batch={best_batch['batch_size']} ({best_batch['items_per_second']} items/s) | "
  876 + f"best_concurrency={best_concurrency['concurrency']} ({best_concurrency['items_per_second']} items/s @ batch={best_concurrency['batch_size']})"
  877 + )
  878 + else:
  879 + runtime = item["runtime"]
  880 + print(
  881 + f"{item['scenario']['name']}: "
  882 + f"{runtime['items_per_second']} items/s | "
  883 + f"avg_item={runtime['avg_item_latency_ms']} ms | "
  884 + f"p95_req={runtime['request_latency_p95_ms']} ms | "
  885 + f"load={runtime['load_seconds']} s"
  886 + )
413 887  
414 888  
415 889 if __name__ == "__main__":
... ...
translation/README.md
... ... @@ -13,7 +13,7 @@
13 13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh)
14 14 - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py)
15 15 - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
16   -- 性能报告:[`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md)
  16 +- 性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
17 17  
18 18 ## 1. 设计目标
19 19  
... ... @@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \
530 530 数据集:
531 531 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
532 532  
  533 +最新报告:
  534 +- 摘要:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
  535 +- 完整 Markdown:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  536 +- 完整 JSON:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  537 +
  538 +### 10.1 先看哪组数据
  539 +
  540 +这里把 3 类结果分开看,不再混在一张表里:
  541 +
  542 +- `batch_sweep`
  543 + 固定 `concurrency=1`,只比较不同 `batch_size` 的单流批处理性能
  544 +- `concurrency_sweep`
  545 + 固定 `batch_size=1`,看“单条请求”在不同并发下的延迟和吞吐
  546 +- `batch x concurrency matrix`
  547 + 同时看 `batch_size` 和 `concurrency` 的交互效应;本轮限制为 `batch_size * concurrency <= 128`
  548 +
  549 +建议:
  550 +
  551 +- 看线上 query 翻译延迟:优先看 `concurrency_sweep`
  552 +- 看离线批量翻译吞吐:优先看 `batch_sweep`
  553 +- 看单 worker 服务容量边界:再看 `batch x concurrency matrix`
  554 +
  555 +### 10.2 本轮补测参数
  556 +
  557 +测试时间:`2026-03-18`
  558 +
  559 +环境:
  560 +- GPU:`Tesla T4 16GB`
  561 +- Python env:`.venv-translator`
  562 +- Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  563 +
  564 +统一参数:
  565 +- cache:关闭(`--disable-cache`),避免缓存命中干扰性能结果
  566 +- `batch_sweep`:每档 `256` items
  567 +- `concurrency_sweep`:固定 `batch_size=1`,每档 `32` requests
  568 +- `batch x concurrency matrix`:每档 `32` requests,且只保留 `batch_size * concurrency <= 128`
  569 +- 预热:`1` batch
  570 +
533 571 复现命令:
534 572  
535 573 ```bash
... ... @@ -537,16 +575,36 @@ cd /data/saas-search
537 575 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py
538 576 ```
539 577  
540   -单模型复现示例:
  578 +本轮扩展压测复现命令:
  579 +
  580 +```bash
  581 +cd /data/saas-search
  582 +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  583 + --suite extended \
  584 + --disable-cache \
  585 + --serial-items-per-case 256 \
  586 + --concurrency-requests-per-case 32 \
  587 + --concurrency-batch-size 1 \
  588 + --output-dir perf_reports/20260318/translation_local_models
  589 +```
  590 +
  591 +单模型扩展压测示例:
541 592  
542 593 ```bash
543 594 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
544 595 --single \
  596 + --suite extended \
545 597 --model opus-mt-zh-en \
546 598 --source-lang zh \
547 599 --target-lang en \
548 600 --column title_cn \
549   - --scene sku_name
  601 + --scene sku_name \
  602 + --disable-cache \
  603 + --batch-size-list 1,4,8,16,32,64 \
  604 + --concurrency-list 1,2,4,8,16,64 \
  605 + --serial-items-per-case 256 \
  606 + --concurrency-requests-per-case 32 \
  607 + --concurrency-batch-size 1
550 608 ```
551 609  
552 610 单条请求延迟复现:
... ... @@ -554,37 +612,143 @@ cd /data/saas-search
554 612 ```bash
555 613 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
556 614 --single \
  615 + --suite extended \
557 616 --model nllb-200-distilled-600m \
558 617 --source-lang zh \
559 618 --target-lang en \
560 619 --column title_cn \
561 620 --scene sku_name \
562   - --batch-size 1 \
563   - --limit 100
  621 + --disable-cache \
  622 + --batch-size-list 1 \
  623 + --concurrency-list 1,2,4,8,16,64 \
  624 + --serial-items-per-case 256 \
  625 + --concurrency-requests-per-case 32 \
  626 + --concurrency-batch-size 1
564 627 ```
565 628  
566   -说明:
567   -- 对当前脚本和本地 backend 来说,“单条请求”可以直接等价理解为 `batch_size=1`
568   -- 此时脚本里的 `batch_latency_*`,就可以直接视为“单次请求延迟”指标
569   -- 线上搜索 query 翻译更应该关注这组数据,而不是大 batch 吞吐
  629 +### 10.3 单流 batch 结果
  630 +
  631 +这组只看 `concurrency=1`,不要把这里的 `request p95` 当作线上并发请求的 p95。
  632 +
  633 +`nllb-200-distilled-600m zh -> en`
  634 +
  635 +| Batch | Items/s | Avg item ms | Req p95 ms |
  636 +|---:|---:|---:|---:|
  637 +| 1 | 2.91 | 343.488 | 616.27 |
  638 +| 4 | 8.44 | 118.545 | 722.95 |
  639 +| 8 | 14.85 | 67.335 | 728.47 |
  640 +| 16 | 27.28 | 36.662 | 769.18 |
  641 +| 32 | 38.6 | 25.908 | 1369.88 |
  642 +| 64 | 58.3 | 17.152 | 1659.9 |
  643 +
  644 +`nllb-200-distilled-600m en -> zh`
  645 +
  646 +| Batch | Items/s | Avg item ms | Req p95 ms |
  647 +|---:|---:|---:|---:|
  648 +| 1 | 1.91 | 524.917 | 866.33 |
  649 +| 4 | 4.94 | 202.473 | 1599.74 |
  650 +| 8 | 8.25 | 121.188 | 1632.29 |
  651 +| 16 | 13.52 | 73.956 | 1649.65 |
  652 +| 32 | 21.27 | 47.017 | 1827.16 |
  653 +| 64 | 32.64 | 30.641 | 2031.25 |
570 654  
571   -当前单条请求实测(`Tesla T4`,`limit=100`):
572   -- `nllb-200-distilled-600m zh->en`:p50 约 `292.54 ms`,p95 约 `624.12 ms`,平均约 `321.91 ms`
573   -- `nllb-200-distilled-600m en->zh`:p50 约 `481.61 ms`,p95 约 `1171.71 ms`,平均约 `542.47 ms`
  655 +`opus-mt-zh-en zh -> en`
574 656  
575   -当前压测环境:
576   -- GPU:`Tesla T4 16GB`
577   -- Python env:`.venv-translator`
578   -- 数据量:`18,576` 条商品标题
  657 +| Batch | Items/s | Avg item ms | Req p95 ms |
  658 +|---:|---:|---:|---:|
  659 +| 1 | 6.15 | 162.536 | 274.74 |
  660 +| 4 | 15.34 | 65.192 | 356.0 |
  661 +| 8 | 25.51 | 39.202 | 379.84 |
  662 +| 16 | 41.44 | 24.129 | 797.93 |
  663 +| 32 | 54.36 | 18.397 | 1693.31 |
  664 +| 64 | 70.15 | 14.255 | 2161.59 |
  665 +
  666 +`opus-mt-en-zh en -> zh`
  667 +
  668 +| Batch | Items/s | Avg item ms | Req p95 ms |
  669 +|---:|---:|---:|---:|
  670 +| 1 | 4.53 | 220.598 | 411.57 |
  671 +| 4 | 10.12 | 98.844 | 761.49 |
  672 +| 8 | 14.63 | 68.361 | 1930.85 |
  673 +| 16 | 24.33 | 41.1 | 2098.54 |
  674 +| 32 | 33.91 | 29.487 | 2152.28 |
  675 +| 64 | 42.47 | 23.547 | 2371.85 |
  676 +
  677 +批处理结论:
  678 +
  679 +- 纯吞吐看,4 个方向的最佳 raw throughput 都出现在 `batch_size=64`
  680 +- 如果还要兼顾单个 batch 的尾延迟,`batch_size=16` 往往更均衡
  681 +- `opus-mt-zh-en` 是本轮 bulk 场景最快模型,`nllb en->zh` 最慢
  682 +
  683 +### 10.4 单条请求并发结果
  684 +
  685 +这组固定 `batch_size=1`,可以直接理解成“单条请求在不同并发下的表现”。
  686 +
  687 +`nllb-200-distilled-600m zh -> en`
  688 +
  689 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  690 +|---:|---:|---:|---:|---:|
  691 +| 1 | 4.17 | 239.99 | 226.34 | 373.27 |
  692 +| 2 | 4.1 | 477.99 | 459.36 | 703.96 |
  693 +| 4 | 4.1 | 910.74 | 884.71 | 1227.01 |
  694 +| 8 | 4.04 | 1697.73 | 1818.48 | 2383.8 |
  695 +| 16 | 4.07 | 2801.91 | 3473.63 | 4145.92 |
  696 +| 64 | 4.04 | 3714.49 | 3610.08 | 7337.3 |
  697 +
  698 +`nllb-200-distilled-600m en -> zh`
  699 +
  700 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  701 +|---:|---:|---:|---:|---:|
  702 +| 1 | 2.16 | 463.18 | 439.54 | 670.78 |
  703 +| 2 | 2.15 | 920.48 | 908.27 | 1213.3 |
  704 +| 4 | 2.16 | 1759.87 | 1771.58 | 2158.04 |
  705 +| 8 | 2.15 | 3284.44 | 3658.45 | 3971.01 |
  706 +| 16 | 2.14 | 5669.15 | 7117.7 | 7522.48 |
  707 +| 64 | 2.14 | 7631.14 | 7510.97 | 14139.03 |
  708 +
  709 +`opus-mt-zh-en zh -> en`
  710 +
  711 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  712 +|---:|---:|---:|---:|---:|
  713 +| 1 | 9.21 | 108.53 | 91.7 | 179.12 |
  714 +| 2 | 8.92 | 219.19 | 212.29 | 305.34 |
  715 +| 4 | 9.09 | 411.76 | 420.08 | 583.97 |
  716 +| 8 | 8.85 | 784.14 | 835.73 | 1043.06 |
  717 +| 16 | 9.01 | 1278.4 | 1483.34 | 1994.56 |
  718 +| 64 | 8.82 | 1687.08 | 1563.48 | 3381.58 |
  719 +
  720 +`opus-mt-en-zh en -> zh`
  721 +
  722 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  723 +|---:|---:|---:|---:|---:|
  724 +| 1 | 3.6 | 277.73 | 145.85 | 1180.37 |
  725 +| 2 | 3.55 | 559.38 | 346.71 | 1916.96 |
  726 +| 4 | 3.53 | 997.71 | 721.04 | 2944.17 |
  727 +| 8 | 3.51 | 1644.28 | 1590.93 | 3632.99 |
  728 +| 16 | 3.5 | 2600.18 | 2586.34 | 5554.04 |
  729 +| 64 | 3.52 | 3366.52 | 2780.0 | 7950.41 |
579 730  
580   -最终性能结果:
  731 +并发结论:
  732 +
  733 +- 当前本地 seq2seq backend 内部是单模型锁,单 worker 下提高客户端并发基本不会提升吞吐,主要会把等待时间堆到请求延迟上
  734 +- 线上 query 翻译如果追求稳定延迟,应优先控制在低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
  735 +- 在线场景里,`opus-mt-zh-en` 延迟最稳;`nllb en->zh` 最慢,且并发放大后尾延迟最明显
  736 +
  737 +### 10.5 batch x concurrency 怎么看
  738 +
  739 +完整矩阵见:
  740 +- [`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  741 +
  742 +这张表主要回答两个问题:
581 743  
582   -| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
583   -|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
584   -| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
585   -| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
586   -| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
587   -| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
  744 +- 如果已经知道自己要跑离线批处理,`batch_size` 拉大后,在不同并发下吞吐会不会继续涨
  745 +- 如果要拿单 worker 服务扛请求,在哪个 `batch_size x concurrency` 组合下开始明显排队
  746 +
  747 +本轮矩阵的共同特征:
  748 +
  749 +- 吞吐主要由 `batch_size` 决定,`concurrency` 不是主要增益来源
  750 +- 在 `batch_size` 固定时,`concurrency` 从 `1` 升到 `2/4/8/...`,`items/s` 变化很小,但 `avg req ms / p95` 会持续抬升
  751 +- 因此当前实现更像“单 worker + 内部串行 GPU 推理服务”,不是一个靠客户端并发放大吞吐的服务
588 752  
589 753 NLLB 性能优化经验:
590 754  
... ... @@ -632,7 +796,7 @@ NLLB 性能优化经验:
632 796 - 运行方式:单 worker,避免重复加载
633 797  
634 798 更详细的性能说明见:
635   -- [`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md)
  799 +- [`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
636 800  
637 801 ## 11. 开发说明
638 802  
... ...