Commit 2a6d9d76f52556f65e4c3291fc402660f6a21817

Authored by tangwang
1 parent cd4ce66d

更新了压测脚本和文档,让“单条请求 / 单流 batch / 并发 /

batch×并发矩阵”彻底分开展示。

改动在这几处:

scripts/benchmark_translation_local_models.py:新增 --suite
extended,支持
batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size *
concurrency <= 128
的组合矩阵;并且单场景模式现在只加载目标模型,load_seconds
更干净,也支持 --disable-cache。
translation/README.md:把性能章节拆成了
batch_sweep、concurrency_sweep、batch x concurrency matrix
三块,补了这次复测的参数、复现命令和摘要表。
perf_reports/20260318/translation_local_models/README.md:新增本轮补测摘要。
完整结果在 translation_local_models_extended_221846.md 和
translation_local_models_extended_221846.json。
这次补测的核心结论很明确:

在线单条请求应该看 concurrency_sweep,也就是固定 batch_size=1 的表。
离线批量吞吐应该看 batch_sweep,4 个方向的最高 raw throughput 都出现在
batch_size=64,但更均衡的默认值仍更像 batch_size=16。
当前本地 seq2seq backend
有单模型锁,提升客户端并发几乎不涨吞吐,主要是把排队时间变成更高的
p95;所以并发更像“延迟预算”问题,不是“扩容吞吐”手段。
本轮在线单条里最快的是 opus-mt-zh-en;最慢、且并发放大最明显的是
nllb-200-distilled-600m en->zh。
1 1
2 2
3 -product_enrich : Partial Mode 3 +
  4 +
  5 +nllb-200-distilled-600M性能优化
  6 +请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型,有哪些性能优化方案,提高线上翻译服务的吞吐量、降低耗时,搜索相关的在线推理服务方案,找到高性能的服务化方法
  7 +
  8 +cnclip的性能优化
  9 +
  10 +rerank 性能优化
  11 +
  12 +
  13 +超时
  14 +Query 分析阶段等待翻译/embedding 的硬超时
  15 +配置文件位置:config/config.yaml
  16 +配置项:query_config.async_wait_timeout_ms: 80
  17 +代码生效点:query/query_parser.py 使用该值换算成秒传给 wait(...)
  18 +2) Embedding HTTP 调用超时(Text/Image)
  19 +不再使用任何环境变量覆盖(之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用)
  20 +配置文件位置:config/config.yaml
  21 +配置项:services.embedding.providers.http.timeout_sec(已在 YAML 里补了示例默认 60)
  22 +代码生效点:
  23 +embeddings/text_encoder.py:requests.post(..., timeout=self.timeout_sec)
  24 +embeddings/image_encoder.py:requests.post(..., timeout=self.timeout_sec)
  25 +
  26 +
  27 +
  28 +
  29 +product_enrich : Partial Mode : done
4 https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR 30 https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR
5 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下: 31 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下:
6 [ 32 [
@@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men @@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men
15 } 41 }
16 ] 42 ]
17 模型会以前缀内容为起点开始生成。 43 模型会以前缀内容为起点开始生成。
18 -  
19 支持 非思考模式。 44 支持 非思考模式。
20 45
21 46
@@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men @@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men
41 66
42 67
43 68
44 -翻译的cache需要重构  
45 -  
46 -  
47 -  
48 -  
49 -  
50 69
51 suggest 索引,现在是全量脚本,要交给金伟 70 suggest 索引,现在是全量脚本,要交给金伟
52 71
perf_reports/20260318/translation_local_models/README.md 0 → 100644
@@ -0,0 +1,101 @@ @@ -0,0 +1,101 @@
  1 +# Local Translation Model Benchmark Report
  2 +
  3 +测试脚本:
  4 +- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  5 +
  6 +完整结果:
  7 +- Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  8 +- JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  9 +
  10 +测试时间:
  11 +- `2026-03-18`
  12 +
  13 +环境:
  14 +- GPU:`Tesla T4 16GB`
  15 +- Driver / CUDA:`570.158.01 / 12.8`
  16 +- Python env:`.venv-translator`
  17 +- Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  18 +- 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
  19 +
  20 +## Method
  21 +
  22 +这轮把结果拆成 3 类:
  23 +
  24 +- `batch_sweep`
  25 + 固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64`
  26 +- `concurrency_sweep`
  27 + 固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64`
  28 +- `batch x concurrency matrix`
  29 + 组合压测,保留 `batch_size * concurrency <= 128`
  30 +
  31 +统一设定:
  32 +- 关闭 cache:`--disable-cache`
  33 +- `batch_sweep`:每档 `256` items
  34 +- `concurrency_sweep`:每档 `32` requests
  35 +- `matrix`:每档 `32` requests
  36 +- 预热:`1` batch
  37 +
  38 +复现命令:
  39 +
  40 +```bash
  41 +cd /data/saas-search
  42 +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  43 + --suite extended \
  44 + --disable-cache \
  45 + --serial-items-per-case 256 \
  46 + --concurrency-requests-per-case 32 \
  47 + --concurrency-batch-size 1 \
  48 + --output-dir perf_reports/20260318/translation_local_models
  49 +```
  50 +
  51 +## Key Results
  52 +
  53 +### 1. 单流 batch sweep
  54 +
  55 +| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
  56 +|---|---|---:|---:|---:|---:|
  57 +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
  58 +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
  59 +| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
  60 +| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
  61 +
  62 +解读:
  63 +- 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高
  64 +- 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选
  65 +- 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
  66 +
  67 +### 2. 单条请求并发 sweep
  68 +
  69 +| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
  70 +|---|---|---:|---:|---:|---:|
  71 +| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
  72 +| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
  73 +| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
  74 +| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
  75 +
  76 +解读:
  77 +- `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟
  78 +- 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
  79 +- 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh`
  80 +
  81 +### 3. batch x concurrency matrix
  82 +
  83 +最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下):
  84 +
  85 +| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
  86 +|---|---|---:|---:|---:|---:|---:|
  87 +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
  88 +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
  89 +| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
  90 +| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
  91 +
  92 +解读:
  93 +- 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定
  94 +- 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升
  95 +- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐
  96 +
  97 +## Recommendation
  98 +
  99 +- 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径
  100 +- 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64`
  101 +- 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl 0 → 100644
perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md 0 → 100644
@@ -0,0 +1,263 @@ @@ -0,0 +1,263 @@
  1 +# Local Translation Model Extended Benchmark
  2 +
  3 +- Generated at: `2026-03-18T21:28:09`
  4 +- Suite: `extended`
  5 +- Python: `3.12.3`
  6 +- Torch: `2.10.0+cu128`
  7 +- Transformers: `5.3.0`
  8 +- CUDA: `True`
  9 +- GPU: `Tesla T4` (15.56 GiB)
  10 +
  11 +## Reading Guide
  12 +
  13 +- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.
  14 +- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.
  15 +- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.
  16 +
  17 +## nllb-200-distilled-600m zh->en
  18 +
  19 +- Direction: `zh -> en`
  20 +- Column: `title_cn`
  21 +- Loaded rows: `2048`
  22 +- Load time: `6.118 s`
  23 +- Device: `cuda`
  24 +- DType: `torch.float16`
  25 +- Cache disabled: `True`
  26 +
  27 +### Batch Sweep (`concurrency=1`)
  28 +
  29 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  30 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  31 +| 1 | 256 | 256 | 2.91 | 2.91 | 343.48 | 315.98 | 616.27 | 2.633 |
  32 +| 4 | 256 | 64 | 8.44 | 2.11 | 474.17 | 474.75 | 722.95 | 2.659 |
  33 +| 8 | 256 | 32 | 14.85 | 1.86 | 538.68 | 564.97 | 728.47 | 2.699 |
  34 +| 16 | 256 | 16 | 27.28 | 1.7 | 586.59 | 633.16 | 769.18 | 2.765 |
  35 +| 32 | 256 | 8 | 38.6 | 1.21 | 829.04 | 761.74 | 1369.88 | 2.987 |
  36 +| 64 | 256 | 4 | 58.3 | 0.91 | 1097.73 | 919.35 | 1659.9 | 3.347 |
  37 +
  38 +### Concurrency Sweep (`batch_size=1`)
  39 +
  40 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  41 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  42 +| 1 | 32 | 32 | 4.17 | 4.17 | 239.99 | 226.34 | 373.27 | 3.347 |
  43 +| 2 | 32 | 32 | 4.1 | 4.1 | 477.99 | 459.36 | 703.96 | 3.347 |
  44 +| 4 | 32 | 32 | 4.1 | 4.1 | 910.74 | 884.71 | 1227.01 | 3.347 |
  45 +| 8 | 32 | 32 | 4.04 | 4.04 | 1697.73 | 1818.48 | 2383.8 | 3.347 |
  46 +| 16 | 32 | 32 | 4.07 | 4.07 | 2801.91 | 3473.63 | 4145.92 | 3.347 |
  47 +| 64 | 32 | 32 | 4.04 | 4.04 | 3714.49 | 3610.08 | 7337.3 | 3.347 |
  48 +
  49 +### Batch x Concurrency Matrix
  50 +
  51 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  52 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  53 +| 1 | 1 | 32 | 32 | 3.94 | 3.94 | 253.96 | 231.8 | 378.76 | 3.347 |
  54 +| 1 | 2 | 32 | 32 | 3.92 | 3.92 | 500.35 | 480.18 | 747.3 | 3.347 |
  55 +| 1 | 4 | 32 | 32 | 4.05 | 4.05 | 922.53 | 894.57 | 1235.7 | 3.347 |
  56 +| 1 | 8 | 32 | 32 | 4.08 | 4.08 | 1679.71 | 1806.95 | 2346.11 | 3.347 |
  57 +| 1 | 16 | 32 | 32 | 4.05 | 4.05 | 2816.51 | 3485.68 | 4181.39 | 3.347 |
  58 +| 1 | 64 | 32 | 32 | 4.06 | 4.06 | 3711.9 | 3614.34 | 7319.52 | 3.347 |
  59 +| 4 | 1 | 128 | 32 | 9.12 | 2.28 | 438.42 | 386.12 | 724.47 | 3.347 |
  60 +| 4 | 2 | 128 | 32 | 9.2 | 2.3 | 852.37 | 756.47 | 1366.7 | 3.347 |
  61 +| 4 | 4 | 128 | 32 | 9.17 | 2.29 | 1642.9 | 1524.56 | 2662.49 | 3.347 |
  62 +| 4 | 8 | 128 | 32 | 9.18 | 2.29 | 2974.7 | 3168.19 | 4594.99 | 3.347 |
  63 +| 4 | 16 | 128 | 32 | 9.23 | 2.31 | 4897.43 | 5815.91 | 8210.23 | 3.347 |
  64 +| 8 | 1 | 256 | 32 | 14.73 | 1.84 | 543.03 | 556.97 | 734.67 | 3.347 |
  65 +| 8 | 2 | 256 | 32 | 14.73 | 1.84 | 1064.16 | 1132.78 | 1405.53 | 3.347 |
  66 +| 8 | 4 | 256 | 32 | 14.79 | 1.85 | 2035.46 | 2237.69 | 2595.53 | 3.347 |
  67 +| 8 | 8 | 256 | 32 | 14.76 | 1.84 | 3763.14 | 4345.6 | 5025.02 | 3.347 |
  68 +| 8 | 16 | 256 | 32 | 14.77 | 1.85 | 6383.08 | 8053.36 | 9380.87 | 3.347 |
  69 +| 16 | 1 | 512 | 32 | 24.88 | 1.55 | 643.11 | 661.91 | 814.26 | 3.347 |
  70 +| 16 | 2 | 512 | 32 | 24.91 | 1.56 | 1265.17 | 1326.85 | 1576.96 | 3.347 |
  71 +| 16 | 4 | 512 | 32 | 24.94 | 1.56 | 2446.55 | 2594.26 | 3027.37 | 3.347 |
  72 +| 16 | 8 | 512 | 32 | 24.8 | 1.55 | 4548.92 | 5035.56 | 5852.87 | 3.347 |
  73 +| 32 | 1 | 1024 | 32 | 34.9 | 1.09 | 916.78 | 823.68 | 1637.59 | 3.347 |
  74 +| 32 | 2 | 1024 | 32 | 34.98 | 1.09 | 1807.39 | 1717.89 | 2514.27 | 3.347 |
  75 +| 32 | 4 | 1024 | 32 | 34.85 | 1.09 | 3513.42 | 3779.99 | 4750.34 | 3.347 |
  76 +| 64 | 1 | 2048 | 32 | 53.24 | 0.83 | 1202.07 | 993.59 | 1838.72 | 3.642 |
  77 +| 64 | 2 | 2048 | 32 | 53.95 | 0.84 | 2344.92 | 2465.53 | 3562.04 | 3.642 |
  78 +
  79 +## nllb-200-distilled-600m en->zh
  80 +
  81 +- Direction: `en -> zh`
  82 +- Column: `title`
  83 +- Loaded rows: `2048`
  84 +- Load time: `6.137 s`
  85 +- Device: `cuda`
  86 +- DType: `torch.float16`
  87 +- Cache disabled: `True`
  88 +
  89 +### Batch Sweep (`concurrency=1`)
  90 +
  91 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  92 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  93 +| 1 | 256 | 256 | 1.91 | 1.91 | 524.91 | 497.81 | 866.33 | 2.636 |
  94 +| 4 | 256 | 64 | 4.94 | 1.23 | 809.89 | 743.87 | 1599.74 | 2.684 |
  95 +| 8 | 256 | 32 | 8.25 | 1.03 | 969.5 | 796.64 | 1632.29 | 2.723 |
  96 +| 16 | 256 | 16 | 13.52 | 0.85 | 1183.29 | 1169.28 | 1649.65 | 2.817 |
  97 +| 32 | 256 | 8 | 21.27 | 0.66 | 1504.54 | 1647.92 | 1827.16 | 3.007 |
  98 +| 64 | 256 | 4 | 32.64 | 0.51 | 1961.02 | 1985.8 | 2031.25 | 3.408 |
  99 +
  100 +### Concurrency Sweep (`batch_size=1`)
  101 +
  102 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  103 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  104 +| 1 | 32 | 32 | 2.16 | 2.16 | 463.18 | 439.54 | 670.78 | 3.408 |
  105 +| 2 | 32 | 32 | 2.15 | 2.15 | 920.48 | 908.27 | 1213.3 | 3.408 |
  106 +| 4 | 32 | 32 | 2.16 | 2.16 | 1759.87 | 1771.58 | 2158.04 | 3.408 |
  107 +| 8 | 32 | 32 | 2.15 | 2.15 | 3284.44 | 3658.45 | 3971.01 | 3.408 |
  108 +| 16 | 32 | 32 | 2.14 | 2.14 | 5669.15 | 7117.7 | 7522.48 | 3.408 |
  109 +| 64 | 32 | 32 | 2.14 | 2.14 | 7631.14 | 7510.97 | 14139.03 | 3.408 |
  110 +
  111 +### Batch x Concurrency Matrix
  112 +
  113 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  114 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  115 +| 1 | 1 | 32 | 32 | 2.17 | 2.17 | 461.18 | 439.28 | 673.59 | 3.408 |
  116 +| 1 | 2 | 32 | 32 | 2.15 | 2.15 | 919.12 | 904.83 | 1209.83 | 3.408 |
  117 +| 1 | 4 | 32 | 32 | 2.14 | 2.14 | 1771.34 | 1810.99 | 2171.71 | 3.408 |
  118 +| 1 | 8 | 32 | 32 | 2.12 | 2.12 | 3324.85 | 3704.57 | 3999.29 | 3.408 |
  119 +| 1 | 16 | 32 | 32 | 2.15 | 2.15 | 5651.09 | 7121.8 | 7495.67 | 3.408 |
  120 +| 1 | 64 | 32 | 32 | 2.15 | 2.15 | 7619.16 | 7505.14 | 14111.53 | 3.408 |
  121 +| 4 | 1 | 128 | 32 | 5.13 | 1.28 | 780.0 | 661.74 | 1586.9 | 3.408 |
  122 +| 4 | 2 | 128 | 32 | 5.09 | 1.27 | 1551.81 | 1368.81 | 2775.92 | 3.408 |
  123 +| 4 | 4 | 128 | 32 | 5.13 | 1.28 | 2995.57 | 2836.77 | 4532.67 | 3.408 |
  124 +| 4 | 8 | 128 | 32 | 5.16 | 1.29 | 5619.47 | 6046.81 | 7767.22 | 3.408 |
  125 +| 4 | 16 | 128 | 32 | 5.15 | 1.29 | 9546.3 | 12146.68 | 14537.41 | 3.408 |
  126 +| 8 | 1 | 256 | 32 | 8.36 | 1.04 | 957.45 | 784.85 | 1591.01 | 3.408 |
  127 +| 8 | 2 | 256 | 32 | 8.29 | 1.04 | 1906.56 | 1847.53 | 2799.1 | 3.408 |
  128 +| 8 | 4 | 256 | 32 | 8.3 | 1.04 | 3691.59 | 3773.67 | 4849.58 | 3.408 |
  129 +| 8 | 8 | 256 | 32 | 8.29 | 1.04 | 6954.22 | 7681.7 | 8904.0 | 3.408 |
  130 +| 8 | 16 | 256 | 32 | 8.31 | 1.04 | 11923.83 | 14903.17 | 17072.19 | 3.408 |
  131 +| 16 | 1 | 512 | 32 | 13.83 | 0.86 | 1156.84 | 904.4 | 1609.17 | 3.408 |
  132 +| 16 | 2 | 512 | 32 | 13.91 | 0.87 | 2276.35 | 2356.15 | 3121.45 | 3.408 |
  133 +| 16 | 4 | 512 | 32 | 13.89 | 0.87 | 4440.77 | 4733.34 | 5488.65 | 3.408 |
  134 +| 16 | 8 | 512 | 32 | 13.83 | 0.86 | 8428.17 | 9407.5 | 10409.59 | 3.408 |
  135 +| 32 | 1 | 1024 | 32 | 23.28 | 0.73 | 1374.39 | 1603.35 | 1806.99 | 3.408 |
  136 +| 32 | 2 | 1024 | 32 | 23.26 | 0.73 | 2721.28 | 2607.91 | 3437.08 | 3.408 |
  137 +| 32 | 4 | 1024 | 32 | 23.18 | 0.72 | 5297.48 | 5234.53 | 6758.76 | 3.408 |
  138 +| 64 | 1 | 2048 | 32 | 34.97 | 0.55 | 1829.91 | 1899.54 | 2039.18 | 3.69 |
  139 +| 64 | 2 | 2048 | 32 | 34.78 | 0.54 | 3615.7 | 3795.36 | 4014.19 | 3.69 |
  140 +
  141 +## opus-mt-zh-en zh->en
  142 +
  143 +- Direction: `zh -> en`
  144 +- Column: `title_cn`
  145 +- Loaded rows: `2048`
  146 +- Load time: `3.2561 s`
  147 +- Device: `cuda`
  148 +- DType: `torch.float16`
  149 +- Cache disabled: `True`
  150 +
  151 +### Batch Sweep (`concurrency=1`)
  152 +
  153 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  154 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  155 +| 1 | 256 | 256 | 6.15 | 6.15 | 162.53 | 145.96 | 274.74 | 0.285 |
  156 +| 4 | 256 | 64 | 15.34 | 3.83 | 260.77 | 231.11 | 356.0 | 0.306 |
  157 +| 8 | 256 | 32 | 25.51 | 3.19 | 313.61 | 271.09 | 379.84 | 0.324 |
  158 +| 16 | 256 | 16 | 41.44 | 2.59 | 386.06 | 286.4 | 797.93 | 0.366 |
  159 +| 32 | 256 | 8 | 54.36 | 1.7 | 588.7 | 330.13 | 1693.31 | 0.461 |
  160 +| 64 | 256 | 4 | 70.15 | 1.1 | 912.33 | 447.93 | 2161.59 | 0.642 |
  161 +
  162 +### Concurrency Sweep (`batch_size=1`)
  163 +
  164 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  165 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  166 +| 1 | 32 | 32 | 9.21 | 9.21 | 108.53 | 91.7 | 179.12 | 0.642 |
  167 +| 2 | 32 | 32 | 8.92 | 8.92 | 219.19 | 212.29 | 305.34 | 0.642 |
  168 +| 4 | 32 | 32 | 9.09 | 9.09 | 411.76 | 420.08 | 583.97 | 0.642 |
  169 +| 8 | 32 | 32 | 8.85 | 8.85 | 784.14 | 835.73 | 1043.06 | 0.642 |
  170 +| 16 | 32 | 32 | 9.01 | 9.01 | 1278.4 | 1483.34 | 1994.56 | 0.642 |
  171 +| 64 | 32 | 32 | 8.82 | 8.82 | 1687.08 | 1563.48 | 3381.58 | 0.642 |
  172 +
  173 +### Batch x Concurrency Matrix
  174 +
  175 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  176 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  177 +| 1 | 1 | 32 | 32 | 9.01 | 9.01 | 110.96 | 97.37 | 191.91 | 0.642 |
  178 +| 1 | 2 | 32 | 32 | 9.2 | 9.2 | 212.43 | 191.27 | 299.65 | 0.642 |
  179 +| 1 | 4 | 32 | 32 | 8.86 | 8.86 | 423.8 | 423.61 | 596.78 | 0.642 |
  180 +| 1 | 8 | 32 | 32 | 9.04 | 9.04 | 764.39 | 810.05 | 1035.74 | 0.642 |
  181 +| 1 | 16 | 32 | 32 | 9.06 | 9.06 | 1272.52 | 1468.35 | 1998.99 | 0.642 |
  182 +| 1 | 64 | 32 | 32 | 9.04 | 9.04 | 1663.7 | 1556.13 | 3296.1 | 0.642 |
  183 +| 4 | 1 | 128 | 32 | 15.17 | 3.79 | 263.61 | 201.11 | 369.93 | 0.642 |
  184 +| 4 | 2 | 128 | 32 | 15.25 | 3.81 | 517.25 | 397.13 | 1319.45 | 0.642 |
  185 +| 4 | 4 | 128 | 32 | 15.27 | 3.82 | 899.61 | 746.76 | 1914.27 | 0.642 |
  186 +| 4 | 8 | 128 | 32 | 15.39 | 3.85 | 1539.87 | 1466.41 | 2918.85 | 0.642 |
  187 +| 4 | 16 | 128 | 32 | 15.48 | 3.87 | 2455.89 | 2728.17 | 4644.99 | 0.642 |
  188 +| 8 | 1 | 256 | 32 | 25.44 | 3.18 | 314.41 | 272.16 | 386.05 | 0.642 |
  189 +| 8 | 2 | 256 | 32 | 25.42 | 3.18 | 620.41 | 541.88 | 1418.79 | 0.642 |
  190 +| 8 | 4 | 256 | 32 | 24.96 | 3.12 | 1226.44 | 1070.41 | 2940.56 | 0.642 |
  191 +| 8 | 8 | 256 | 32 | 25.44 | 3.18 | 2252.81 | 2148.42 | 4143.66 | 0.642 |
  192 +| 8 | 16 | 256 | 32 | 25.44 | 3.18 | 3961.06 | 5028.31 | 6340.79 | 0.642 |
  193 +| 16 | 1 | 512 | 32 | 31.66 | 1.98 | 505.38 | 321.69 | 1963.58 | 0.653 |
  194 +| 16 | 2 | 512 | 32 | 31.41 | 1.96 | 1009.12 | 677.84 | 2341.01 | 0.653 |
  195 +| 16 | 4 | 512 | 32 | 31.17 | 1.95 | 2006.54 | 1562.68 | 4299.85 | 0.653 |
  196 +| 16 | 8 | 512 | 32 | 31.51 | 1.97 | 3483.96 | 4085.09 | 5902.32 | 0.653 |
  197 +| 32 | 1 | 1024 | 32 | 38.66 | 1.21 | 827.7 | 409.78 | 2162.48 | 0.748 |
  198 +| 32 | 2 | 1024 | 32 | 38.75 | 1.21 | 1613.28 | 916.1 | 3228.1 | 0.748 |
  199 +| 32 | 4 | 1024 | 32 | 38.66 | 1.21 | 3162.05 | 3239.36 | 4992.07 | 0.748 |
  200 +| 64 | 1 | 2048 | 32 | 52.44 | 0.82 | 1220.35 | 533.4 | 2508.12 | 0.939 |
  201 +| 64 | 2 | 2048 | 32 | 52.04 | 0.81 | 2443.4 | 2800.35 | 4765.45 | 0.939 |
  202 +
  203 +## opus-mt-en-zh en->zh
  204 +
  205 +- Direction: `en -> zh`
  206 +- Column: `title`
  207 +- Loaded rows: `2048`
  208 +- Load time: `3.1612 s`
  209 +- Device: `cuda`
  210 +- DType: `torch.float16`
  211 +- Cache disabled: `True`
  212 +
  213 +### Batch Sweep (`concurrency=1`)
  214 +
  215 +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  216 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  217 +| 1 | 256 | 256 | 4.53 | 4.53 | 220.59 | 177.37 | 411.57 | 0.285 |
  218 +| 4 | 256 | 64 | 10.12 | 2.53 | 395.37 | 319.23 | 761.49 | 0.307 |
  219 +| 8 | 256 | 32 | 14.63 | 1.83 | 546.88 | 390.81 | 1930.85 | 0.326 |
  220 +| 16 | 256 | 16 | 24.33 | 1.52 | 657.6 | 451.22 | 2098.54 | 0.37 |
  221 +| 32 | 256 | 8 | 33.91 | 1.06 | 943.59 | 603.45 | 2152.28 | 0.462 |
  222 +| 64 | 256 | 4 | 42.47 | 0.66 | 1507.01 | 1531.43 | 2371.85 | 0.644 |
  223 +
  224 +### Concurrency Sweep (`batch_size=1`)
  225 +
  226 +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  227 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  228 +| 1 | 32 | 32 | 3.6 | 3.6 | 277.73 | 145.85 | 1180.37 | 0.644 |
  229 +| 2 | 32 | 32 | 3.55 | 3.55 | 559.38 | 346.71 | 1916.96 | 0.644 |
  230 +| 4 | 32 | 32 | 3.53 | 3.53 | 997.71 | 721.04 | 2944.17 | 0.644 |
  231 +| 8 | 32 | 32 | 3.51 | 3.51 | 1644.28 | 1590.93 | 3632.99 | 0.644 |
  232 +| 16 | 32 | 32 | 3.5 | 3.5 | 2600.18 | 2586.34 | 5554.04 | 0.644 |
  233 +| 64 | 32 | 32 | 3.52 | 3.52 | 3366.52 | 2780.0 | 7950.41 | 0.644 |
  234 +
  235 +### Batch x Concurrency Matrix
  236 +
  237 +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
  238 +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
  239 +| 1 | 1 | 32 | 32 | 3.54 | 3.54 | 282.22 | 148.26 | 1200.93 | 0.644 |
  240 +| 1 | 2 | 32 | 32 | 3.52 | 3.52 | 563.85 | 346.8 | 1957.63 | 0.644 |
  241 +| 1 | 4 | 32 | 32 | 3.52 | 3.52 | 1002.79 | 728.36 | 2949.99 | 0.644 |
  242 +| 1 | 8 | 32 | 32 | 3.55 | 3.55 | 1611.52 | 1462.84 | 3653.34 | 0.644 |
  243 +| 1 | 16 | 32 | 32 | 3.49 | 3.49 | 2582.77 | 2594.49 | 5504.07 | 0.644 |
  244 +| 1 | 64 | 32 | 32 | 3.53 | 3.53 | 3327.65 | 2753.11 | 7907.54 | 0.644 |
  245 +| 4 | 1 | 128 | 32 | 10.31 | 2.58 | 387.75 | 329.42 | 675.07 | 0.644 |
  246 +| 4 | 2 | 128 | 32 | 10.12 | 2.53 | 784.49 | 713.61 | 1561.27 | 0.644 |
  247 +| 4 | 4 | 128 | 32 | 10.08 | 2.52 | 1543.42 | 1438.42 | 2735.85 | 0.644 |
  248 +| 4 | 8 | 128 | 32 | 10.09 | 2.52 | 2918.06 | 2954.39 | 4327.47 | 0.644 |
  249 +| 4 | 16 | 128 | 32 | 10.16 | 2.54 | 5087.62 | 5734.85 | 7343.34 | 0.644 |
  250 +| 8 | 1 | 256 | 32 | 14.49 | 1.81 | 552.16 | 402.18 | 1963.8 | 0.644 |
  251 +| 8 | 2 | 256 | 32 | 14.54 | 1.82 | 1088.96 | 845.04 | 2565.11 | 0.644 |
  252 +| 8 | 4 | 256 | 32 | 14.48 | 1.81 | 2143.18 | 1832.69 | 4487.97 | 0.644 |
  253 +| 8 | 8 | 256 | 32 | 14.51 | 1.81 | 4051.04 | 3734.16 | 5976.18 | 0.644 |
  254 +| 8 | 16 | 256 | 32 | 14.55 | 1.82 | 6626.6 | 6868.91 | 9395.9 | 0.644 |
  255 +| 16 | 1 | 512 | 32 | 19.11 | 1.19 | 837.36 | 457.51 | 2030.2 | 0.661 |
  256 +| 16 | 2 | 512 | 32 | 19.26 | 1.2 | 1599.87 | 1141.06 | 2802.44 | 0.661 |
  257 +| 16 | 4 | 512 | 32 | 19.1 | 1.19 | 3130.81 | 3166.45 | 4987.49 | 0.661 |
  258 +| 16 | 8 | 512 | 32 | 19.27 | 1.2 | 5735.13 | 5351.82 | 8417.85 | 0.661 |
  259 +| 32 | 1 | 1024 | 32 | 26.04 | 0.81 | 1228.66 | 648.06 | 2167.39 | 0.751 |
  260 +| 32 | 2 | 1024 | 32 | 26.21 | 0.82 | 2372.52 | 2593.74 | 4206.96 | 0.751 |
  261 +| 32 | 4 | 1024 | 32 | 26.11 | 0.82 | 4499.96 | 3977.65 | 6920.57 | 0.751 |
  262 +| 64 | 1 | 2048 | 32 | 34.94 | 0.55 | 1831.48 | 2366.24 | 2473.74 | 0.942 |
  263 +| 64 | 2 | 2048 | 32 | 34.75 | 0.54 | 3603.99 | 3281.6 | 4948.72 | 0.942 |
scripts/benchmark_translation_local_models.py
@@ -4,6 +4,7 @@ @@ -4,6 +4,7 @@
4 from __future__ import annotations 4 from __future__ import annotations
5 5
6 import argparse 6 import argparse
  7 +import concurrent.futures
7 import copy 8 import copy
8 import csv 9 import csv
9 import json 10 import json
@@ -16,7 +17,7 @@ import sys @@ -16,7 +17,7 @@ import sys
16 import time 17 import time
17 from datetime import datetime 18 from datetime import datetime
18 from pathlib import Path 19 from pathlib import Path
19 -from typing import Any, Dict, Iterable, List 20 +from typing import Any, Dict, Iterable, List, Sequence
20 21
21 import torch 22 import torch
22 import transformers 23 import transformers
@@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402 @@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402
30 from translation.settings import get_translation_capability # noqa: E402 31 from translation.settings import get_translation_capability # noqa: E402
31 32
32 33
  34 +DEFAULT_BATCH_SIZES = [1, 4, 8, 16, 32, 64]
  35 +DEFAULT_CONCURRENCIES = [1, 2, 4, 8, 16, 64]
  36 +
33 SCENARIOS: List[Dict[str, str]] = [ 37 SCENARIOS: List[Dict[str, str]] = [
34 { 38 {
35 "name": "nllb-200-distilled-600m zh->en", 39 "name": "nllb-200-distilled-600m zh->en",
@@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [ @@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [
69 def parse_args() -> argparse.Namespace: 73 def parse_args() -> argparse.Namespace:
70 parser = argparse.ArgumentParser(description="Benchmark local translation models") 74 parser = argparse.ArgumentParser(description="Benchmark local translation models")
71 parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path") 75 parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path")
72 - parser.add_argument("--limit", type=int, default=0, help="Limit rows for faster experiments; 0 means all") 76 + parser.add_argument("--limit", type=int, default=0, help="Limit rows for baseline or single-case run; 0 means all")
73 parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports") 77 parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports")
74 parser.add_argument("--single", action="store_true", help="Run a single scenario in-process") 78 parser.add_argument("--single", action="store_true", help="Run a single scenario in-process")
75 parser.add_argument("--model", default="", help="Model name for --single mode") 79 parser.add_argument("--model", default="", help="Model name for --single mode")
@@ -84,9 +88,67 @@ def parse_args() -&gt; argparse.Namespace: @@ -84,9 +88,67 @@ def parse_args() -&gt; argparse.Namespace:
84 parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams") 88 parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams")
85 parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa") 89 parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa")
86 parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring") 90 parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring")
  91 + parser.add_argument("--disable-cache", action="store_true", help="Disable translation cache during benchmarks")
  92 + parser.add_argument(
  93 + "--suite",
  94 + choices=["baseline", "extended"],
  95 + default="baseline",
  96 + help="baseline keeps the previous all-scenarios summary; extended adds batch/concurrency/matrix sweeps",
  97 + )
  98 + parser.add_argument(
  99 + "--batch-size-list",
  100 + default="",
  101 + help="Comma-separated batch sizes for extended suite; default 1,4,8,16,32,64",
  102 + )
  103 + parser.add_argument(
  104 + "--concurrency-list",
  105 + default="",
  106 + help="Comma-separated concurrency levels for extended suite; default 1,2,4,8,16,64",
  107 + )
  108 + parser.add_argument(
  109 + "--serial-items-per-case",
  110 + type=int,
  111 + default=512,
  112 + help="Items per batch-size case in extended suite",
  113 + )
  114 + parser.add_argument(
  115 + "--concurrency-requests-per-case",
  116 + type=int,
  117 + default=128,
  118 + help="Requests per concurrency or matrix case in extended suite",
  119 + )
  120 + parser.add_argument(
  121 + "--concurrency-batch-size",
  122 + type=int,
  123 + default=1,
  124 + help="Batch size used by the dedicated concurrency sweep",
  125 + )
  126 + parser.add_argument(
  127 + "--max-batch-concurrency-product",
  128 + type=int,
  129 + default=128,
  130 + help="Skip matrix cases where batch_size * concurrency exceeds this value; 0 disables the limit",
  131 + )
87 return parser.parse_args() 132 return parser.parse_args()
88 133
89 134
  135 +def parse_csv_ints(raw: str, fallback: Sequence[int]) -> List[int]:
  136 + if not raw.strip():
  137 + return list(fallback)
  138 + values: List[int] = []
  139 + for item in raw.split(","):
  140 + stripped = item.strip()
  141 + if not stripped:
  142 + continue
  143 + value = int(stripped)
  144 + if value <= 0:
  145 + raise ValueError(f"Expected positive integer, got {value}")
  146 + values.append(value)
  147 + if not values:
  148 + raise ValueError("Parsed empty integer list")
  149 + return values
  150 +
  151 +
90 def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: 152 def load_texts(csv_path: Path, column: str, limit: int) -> List[str]:
91 texts: List[str] = [] 153 texts: List[str] = []
92 with csv_path.open("r", encoding="utf-8") as handle: 154 with csv_path.open("r", encoding="utf-8") as handle:
@@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -&gt; List[str]: @@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -&gt; List[str]:
102 return texts 164 return texts
103 165
104 166
105 -def batched(values: List[str], batch_size: int) -> Iterable[List[str]]: 167 +def batched(values: Sequence[str], batch_size: int) -> Iterable[List[str]]:
106 for start in range(0, len(values), batch_size): 168 for start in range(0, len(values), batch_size):
107 - yield values[start:start + batch_size] 169 + yield list(values[start:start + batch_size])
108 170
109 171
110 def percentile(values: List[float], p: float) -> float: 172 def percentile(values: List[float], p: float) -> float:
@@ -148,15 +210,34 @@ def build_environment_info() -&gt; Dict[str, Any]: @@ -148,15 +210,34 @@ def build_environment_info() -&gt; Dict[str, Any]:
148 } 210 }
149 211
150 212
151 -def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]:  
152 - csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) 213 +def scenario_from_args(args: argparse.Namespace) -> Dict[str, str]:
  214 + return {
  215 + "name": f"{args.model} {args.source_lang}->{args.target_lang}",
  216 + "model": args.model,
  217 + "source_lang": args.source_lang,
  218 + "target_lang": args.target_lang,
  219 + "column": args.column,
  220 + "scene": args.scene,
  221 + }
  222 +
  223 +
  224 +def build_config_and_capability(
  225 + args: argparse.Namespace,
  226 + *,
  227 + batch_size_override: int | None = None,
  228 +) -> tuple[Dict[str, Any], Dict[str, Any]]:
153 config = copy.deepcopy(get_translation_config()) 229 config = copy.deepcopy(get_translation_config())
  230 + for name, cfg in config["capabilities"].items():
  231 + cfg["enabled"] = name == args.model
  232 + config["default_model"] = args.model
154 capability = get_translation_capability(config, args.model, require_enabled=False) 233 capability = get_translation_capability(config, args.model, require_enabled=False)
155 if args.device_override: 234 if args.device_override:
156 capability["device"] = args.device_override 235 capability["device"] = args.device_override
157 if args.torch_dtype_override: 236 if args.torch_dtype_override:
158 capability["torch_dtype"] = args.torch_dtype_override 237 capability["torch_dtype"] = args.torch_dtype_override
159 - if args.batch_size: 238 + if batch_size_override is not None:
  239 + capability["batch_size"] = batch_size_override
  240 + elif args.batch_size:
160 capability["batch_size"] = args.batch_size 241 capability["batch_size"] = args.batch_size
161 if args.max_new_tokens: 242 if args.max_new_tokens:
162 capability["max_new_tokens"] = args.max_new_tokens 243 capability["max_new_tokens"] = args.max_new_tokens
@@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]: @@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]:
164 capability["num_beams"] = args.num_beams 245 capability["num_beams"] = args.num_beams
165 if args.attn_implementation: 246 if args.attn_implementation:
166 capability["attn_implementation"] = args.attn_implementation 247 capability["attn_implementation"] = args.attn_implementation
  248 + if args.disable_cache:
  249 + capability["use_cache"] = False
167 config["capabilities"][args.model] = capability 250 config["capabilities"][args.model] = capability
168 - configured_batch_size = int(capability.get("batch_size") or 1)  
169 - batch_size = configured_batch_size  
170 - texts = load_texts(csv_path, args.column, args.limit) 251 + return config, capability
171 252
172 - service = TranslationService(config) 253 +
  254 +def ensure_cuda_stats_reset() -> None:
173 if torch.cuda.is_available(): 255 if torch.cuda.is_available():
174 torch.cuda.empty_cache() 256 torch.cuda.empty_cache()
175 torch.cuda.reset_peak_memory_stats() 257 torch.cuda.reset_peak_memory_stats()
176 258
177 - load_start = time.perf_counter()  
178 - backend = service.get_backend(args.model)  
179 - load_seconds = time.perf_counter() - load_start  
180 259
181 - warmup_batches = min(max(args.warmup_batches, 0), max(1, math.ceil(len(texts) / batch_size)))  
182 - for batch in list(batched(texts, batch_size))[:warmup_batches]: 260 +def build_memory_metrics() -> Dict[str, Any]:
  261 + peak_gpu_mem_gb = None
  262 + peak_gpu_reserved_gb = None
  263 + if torch.cuda.is_available():
  264 + peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3)
  265 + peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3)
  266 + max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2)
  267 + return {
  268 + "max_rss_mb": max_rss_mb,
  269 + "peak_gpu_memory_gb": peak_gpu_mem_gb,
  270 + "peak_gpu_reserved_gb": peak_gpu_reserved_gb,
  271 + }
  272 +
  273 +
  274 +def make_request_payload(batch: Sequence[str]) -> str | List[str]:
  275 + if len(batch) == 1:
  276 + return batch[0]
  277 + return list(batch)
  278 +
  279 +
  280 +def benchmark_serial_case(
  281 + *,
  282 + service: TranslationService,
  283 + backend: Any,
  284 + scenario: Dict[str, str],
  285 + capability: Dict[str, Any],
  286 + texts: List[str],
  287 + batch_size: int,
  288 + warmup_batches: int,
  289 +) -> Dict[str, Any]:
  290 + backend.batch_size = batch_size
  291 + measured_batches = list(batched(texts, batch_size))
  292 + warmup_count = min(max(warmup_batches, 0), len(measured_batches))
  293 +
  294 + for batch in measured_batches[:warmup_count]:
183 service.translate( 295 service.translate(
184 - text=batch,  
185 - source_lang=args.source_lang,  
186 - target_lang=args.target_lang,  
187 - model=args.model,  
188 - scene=args.scene, 296 + text=make_request_payload(batch),
  297 + source_lang=scenario["source_lang"],
  298 + target_lang=scenario["target_lang"],
  299 + model=scenario["model"],
  300 + scene=scenario["scene"],
189 ) 301 )
190 302
191 batch_latencies_ms: List[float] = [] 303 batch_latencies_ms: List[float] = []
@@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]: @@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]:
193 failure_count = 0 305 failure_count = 0
194 output_chars = 0 306 output_chars = 0
195 total_input_chars = sum(len(text) for text in texts) 307 total_input_chars = sum(len(text) for text in texts)
196 - measured_batches = list(batched(texts, batch_size))  
197 308
198 start = time.perf_counter() 309 start = time.perf_counter()
199 for batch in measured_batches: 310 for batch in measured_batches:
200 batch_start = time.perf_counter() 311 batch_start = time.perf_counter()
201 outputs = service.translate( 312 outputs = service.translate(
202 - text=batch,  
203 - source_lang=args.source_lang,  
204 - target_lang=args.target_lang,  
205 - model=args.model,  
206 - scene=args.scene, 313 + text=make_request_payload(batch),
  314 + source_lang=scenario["source_lang"],
  315 + target_lang=scenario["target_lang"],
  316 + model=scenario["model"],
  317 + scene=scenario["scene"],
207 ) 318 )
208 elapsed_ms = (time.perf_counter() - batch_start) * 1000 319 elapsed_ms = (time.perf_counter() - batch_start) * 1000
209 batch_latencies_ms.append(elapsed_ms) 320 batch_latencies_ms.append(elapsed_ms)
210 321
211 - if not isinstance(outputs, list):  
212 - raise RuntimeError(f"Expected list output for batch translation, got {type(outputs)!r}")  
213 - for item in outputs: 322 + if isinstance(outputs, list):
  323 + result_items = outputs
  324 + else:
  325 + result_items = [outputs]
  326 + for item in result_items:
214 if item is None: 327 if item is None:
215 failure_count += 1 328 failure_count += 1
216 else: 329 else:
217 success_count += 1 330 success_count += 1
218 output_chars += len(item) 331 output_chars += len(item)
219 translate_seconds = time.perf_counter() - start 332 translate_seconds = time.perf_counter() - start
  333 + total_items = len(texts)
  334 + memory = build_memory_metrics()
220 335
221 - peak_gpu_mem_gb = None  
222 - peak_gpu_reserved_gb = None  
223 - if torch.cuda.is_available():  
224 - peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3)  
225 - peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) 336 + return {
  337 + "mode": "serial_batch",
  338 + "batch_size": batch_size,
  339 + "concurrency": 1,
  340 + "rows": total_items,
  341 + "requests": len(measured_batches),
  342 + "input_chars": total_input_chars,
  343 + "load_seconds": 0.0,
  344 + "translate_seconds": round(translate_seconds, 4),
  345 + "total_seconds": round(translate_seconds, 4),
  346 + "batch_count": len(batch_latencies_ms),
  347 + "request_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2),
  348 + "request_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2),
  349 + "request_latency_max_ms": round(max(batch_latencies_ms), 2),
  350 + "avg_request_latency_ms": round(statistics.fmean(batch_latencies_ms), 2),
  351 + "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3),
  352 + "requests_per_second": round(len(measured_batches) / translate_seconds, 2),
  353 + "items_per_second": round(total_items / translate_seconds, 2),
  354 + "input_chars_per_second": round(total_input_chars / translate_seconds, 2),
  355 + "output_chars_per_second": round(output_chars / translate_seconds, 2),
  356 + "success_count": success_count,
  357 + "failure_count": failure_count,
  358 + "success_rate": round(success_count / total_items, 6),
  359 + "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
  360 + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
  361 + "configured_batch_size": int(capability.get("batch_size") or batch_size),
  362 + "used_batch_size": batch_size,
  363 + "warmup_batches": warmup_count,
  364 + **memory,
  365 + }
226 366
227 - max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2)  
228 - total_items = len(texts) 367 +
  368 +def benchmark_concurrency_case(
  369 + *,
  370 + service: TranslationService,
  371 + backend: Any,
  372 + scenario: Dict[str, str],
  373 + capability: Dict[str, Any],
  374 + texts: List[str],
  375 + batch_size: int,
  376 + concurrency: int,
  377 + requests_per_case: int,
  378 + warmup_batches: int,
  379 +) -> Dict[str, Any]:
  380 + backend.batch_size = batch_size
  381 + required_items = batch_size * requests_per_case
  382 + case_texts = texts[:required_items]
  383 + request_batches = list(batched(case_texts, batch_size))
  384 + if not request_batches:
  385 + raise ValueError("No request batches prepared for concurrency benchmark")
  386 + warmup_count = min(max(warmup_batches, 0), len(request_batches))
  387 +
  388 + for batch in request_batches[:warmup_count]:
  389 + service.translate(
  390 + text=make_request_payload(batch),
  391 + source_lang=scenario["source_lang"],
  392 + target_lang=scenario["target_lang"],
  393 + model=scenario["model"],
  394 + scene=scenario["scene"],
  395 + )
  396 +
  397 + request_latencies_ms: List[float] = []
  398 + success_count = 0
  399 + failure_count = 0
  400 + output_chars = 0
  401 + total_input_chars = sum(len(text) for text in case_texts)
  402 +
  403 + def worker(batch: List[str]) -> tuple[float, int, int, int]:
  404 + started = time.perf_counter()
  405 + outputs = service.translate(
  406 + text=make_request_payload(batch),
  407 + source_lang=scenario["source_lang"],
  408 + target_lang=scenario["target_lang"],
  409 + model=scenario["model"],
  410 + scene=scenario["scene"],
  411 + )
  412 + elapsed_ms = (time.perf_counter() - started) * 1000
  413 + if isinstance(outputs, list):
  414 + result_items = outputs
  415 + else:
  416 + result_items = [outputs]
  417 + local_success = 0
  418 + local_failure = 0
  419 + local_output_chars = 0
  420 + for item in result_items:
  421 + if item is None:
  422 + local_failure += 1
  423 + else:
  424 + local_success += 1
  425 + local_output_chars += len(item)
  426 + return elapsed_ms, local_success, local_failure, local_output_chars
  427 +
  428 + wall_start = time.perf_counter()
  429 + with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
  430 + futures = [executor.submit(worker, batch) for batch in request_batches]
  431 + for future in concurrent.futures.as_completed(futures):
  432 + latency_ms, local_success, local_failure, local_output_chars = future.result()
  433 + request_latencies_ms.append(latency_ms)
  434 + success_count += local_success
  435 + failure_count += local_failure
  436 + output_chars += local_output_chars
  437 + wall_seconds = time.perf_counter() - wall_start
  438 + total_items = len(case_texts)
  439 + memory = build_memory_metrics()
229 440
230 return { 441 return {
231 - "scenario": {  
232 - "name": f"{args.model} {args.source_lang}->{args.target_lang}",  
233 - "model": args.model,  
234 - "source_lang": args.source_lang,  
235 - "target_lang": args.target_lang,  
236 - "column": args.column,  
237 - "scene": args.scene, 442 + "mode": "concurrency",
  443 + "batch_size": batch_size,
  444 + "concurrency": concurrency,
  445 + "rows": total_items,
  446 + "requests": len(request_batches),
  447 + "input_chars": total_input_chars,
  448 + "load_seconds": 0.0,
  449 + "translate_seconds": round(wall_seconds, 4),
  450 + "total_seconds": round(wall_seconds, 4),
  451 + "batch_count": len(request_latencies_ms),
  452 + "request_latency_p50_ms": round(percentile(request_latencies_ms, 0.50), 2),
  453 + "request_latency_p95_ms": round(percentile(request_latencies_ms, 0.95), 2),
  454 + "request_latency_max_ms": round(max(request_latencies_ms), 2),
  455 + "avg_request_latency_ms": round(statistics.fmean(request_latencies_ms), 2),
  456 + "avg_item_latency_ms": round((wall_seconds / total_items) * 1000, 3),
  457 + "requests_per_second": round(len(request_batches) / wall_seconds, 2),
  458 + "items_per_second": round(total_items / wall_seconds, 2),
  459 + "input_chars_per_second": round(total_input_chars / wall_seconds, 2),
  460 + "output_chars_per_second": round(output_chars / wall_seconds, 2),
  461 + "success_count": success_count,
  462 + "failure_count": failure_count,
  463 + "success_rate": round(success_count / total_items, 6),
  464 + "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
  465 + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
  466 + "configured_batch_size": int(capability.get("batch_size") or batch_size),
  467 + "used_batch_size": batch_size,
  468 + "warmup_batches": warmup_count,
  469 + **memory,
  470 + }
  471 +
  472 +
  473 +def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]:
  474 + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
  475 + scenario = scenario_from_args(args)
  476 + config, capability = build_config_and_capability(args)
  477 + configured_batch_size = int(capability.get("batch_size") or 1)
  478 + batch_size = configured_batch_size
  479 + texts = load_texts(csv_path, args.column, args.limit)
  480 +
  481 + ensure_cuda_stats_reset()
  482 + load_start = time.perf_counter()
  483 + service = TranslationService(config)
  484 + backend = service.get_backend(args.model)
  485 + load_seconds = time.perf_counter() - load_start
  486 +
  487 + runtime = benchmark_serial_case(
  488 + service=service,
  489 + backend=backend,
  490 + scenario=scenario,
  491 + capability=capability,
  492 + texts=texts,
  493 + batch_size=batch_size,
  494 + warmup_batches=args.warmup_batches,
  495 + )
  496 + runtime["load_seconds"] = round(load_seconds, 4)
  497 + runtime["total_seconds"] = round(runtime["load_seconds"] + runtime["translate_seconds"], 4)
  498 +
  499 + return {
  500 + "scenario": scenario,
  501 + "dataset": {
  502 + "csv_path": str(csv_path),
  503 + "rows": len(texts),
  504 + "input_chars": sum(len(text) for text in texts),
238 }, 505 },
  506 + "runtime": runtime,
  507 + }
  508 +
  509 +
  510 +def benchmark_extended_scenario(args: argparse.Namespace) -> Dict[str, Any]:
  511 + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
  512 + scenario = scenario_from_args(args)
  513 + batch_sizes = parse_csv_ints(args.batch_size_list, DEFAULT_BATCH_SIZES)
  514 + concurrencies = parse_csv_ints(args.concurrency_list, DEFAULT_CONCURRENCIES)
  515 + largest_batch = max(batch_sizes + [args.concurrency_batch_size])
  516 + largest_concurrency = max(concurrencies)
  517 + max_product = args.max_batch_concurrency_product
  518 + required_items = max(
  519 + args.limit or 0,
  520 + max(args.serial_items_per_case, largest_batch),
  521 + args.concurrency_requests_per_case * args.concurrency_batch_size,
  522 + largest_batch * args.concurrency_requests_per_case,
  523 + )
  524 + texts = load_texts(csv_path, args.column, required_items)
  525 + config, capability = build_config_and_capability(args)
  526 +
  527 + ensure_cuda_stats_reset()
  528 + load_start = time.perf_counter()
  529 + service = TranslationService(config)
  530 + backend = service.get_backend(args.model)
  531 + load_seconds = time.perf_counter() - load_start
  532 +
  533 + batch_sweep: List[Dict[str, Any]] = []
  534 + concurrency_sweep: List[Dict[str, Any]] = []
  535 + matrix_results: List[Dict[str, Any]] = []
  536 +
  537 + for batch_size in batch_sizes:
  538 + case_texts = texts[: max(batch_size, args.serial_items_per_case)]
  539 + batch_sweep.append(
  540 + benchmark_serial_case(
  541 + service=service,
  542 + backend=backend,
  543 + scenario=scenario,
  544 + capability=capability,
  545 + texts=case_texts,
  546 + batch_size=batch_size,
  547 + warmup_batches=args.warmup_batches,
  548 + )
  549 + )
  550 +
  551 + for concurrency in concurrencies:
  552 + concurrency_sweep.append(
  553 + benchmark_concurrency_case(
  554 + service=service,
  555 + backend=backend,
  556 + scenario=scenario,
  557 + capability=capability,
  558 + texts=texts,
  559 + batch_size=args.concurrency_batch_size,
  560 + concurrency=concurrency,
  561 + requests_per_case=args.concurrency_requests_per_case,
  562 + warmup_batches=args.warmup_batches,
  563 + )
  564 + )
  565 +
  566 + for batch_size in batch_sizes:
  567 + for concurrency in concurrencies:
  568 + if max_product > 0 and batch_size * concurrency > max_product:
  569 + continue
  570 + matrix_results.append(
  571 + benchmark_concurrency_case(
  572 + service=service,
  573 + backend=backend,
  574 + scenario=scenario,
  575 + capability=capability,
  576 + texts=texts,
  577 + batch_size=batch_size,
  578 + concurrency=concurrency,
  579 + requests_per_case=args.concurrency_requests_per_case,
  580 + warmup_batches=args.warmup_batches,
  581 + )
  582 + )
  583 +
  584 + for collection in (batch_sweep, concurrency_sweep, matrix_results):
  585 + for idx, item in enumerate(collection):
  586 + item["load_seconds"] = round(load_seconds if idx == 0 else 0.0, 4)
  587 + item["total_seconds"] = round(item["load_seconds"] + item["translate_seconds"], 4)
  588 +
  589 + return {
  590 + "scenario": scenario,
239 "dataset": { 591 "dataset": {
240 "csv_path": str(csv_path), 592 "csv_path": str(csv_path),
241 - "rows": total_items,  
242 - "input_chars": total_input_chars, 593 + "rows_loaded": len(texts),
  594 + },
  595 + "config": {
  596 + "batch_sizes": batch_sizes,
  597 + "concurrencies": concurrencies,
  598 + "serial_items_per_case": args.serial_items_per_case,
  599 + "concurrency_requests_per_case": args.concurrency_requests_per_case,
  600 + "concurrency_batch_size": args.concurrency_batch_size,
  601 + "max_batch_concurrency_product": max_product,
  602 + "cache_disabled": bool(args.disable_cache),
243 }, 603 },
244 - "runtime": { 604 + "runtime_defaults": {
245 "device": str(getattr(backend, "device", capability.get("device", "unknown"))), 605 "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
246 "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), 606 "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
247 - "configured_batch_size": configured_batch_size,  
248 - "used_batch_size": batch_size,  
249 - "warmup_batches": warmup_batches, 607 + "configured_batch_size": int(capability.get("batch_size") or 1),
250 "load_seconds": round(load_seconds, 4), 608 "load_seconds": round(load_seconds, 4),
251 - "translate_seconds": round(translate_seconds, 4),  
252 - "total_seconds": round(load_seconds + translate_seconds, 4),  
253 - "batch_count": len(batch_latencies_ms),  
254 - "first_batch_ms": round(batch_latencies_ms[0], 2),  
255 - "batch_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2),  
256 - "batch_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2),  
257 - "batch_latency_max_ms": round(max(batch_latencies_ms), 2),  
258 - "avg_batch_latency_ms": round(statistics.fmean(batch_latencies_ms), 2),  
259 - "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3),  
260 - "items_per_second": round(total_items / translate_seconds, 2),  
261 - "input_chars_per_second": round(total_input_chars / translate_seconds, 2),  
262 - "output_chars_per_second": round(output_chars / translate_seconds, 2),  
263 - "success_count": success_count,  
264 - "failure_count": failure_count,  
265 - "success_rate": round(success_count / total_items, 6),  
266 - "max_rss_mb": max_rss_mb,  
267 - "peak_gpu_memory_gb": peak_gpu_mem_gb,  
268 - "peak_gpu_reserved_gb": peak_gpu_reserved_gb,  
269 }, 609 },
  610 + "batch_sweep": batch_sweep,
  611 + "concurrency_sweep": concurrency_sweep,
  612 + "matrix": matrix_results,
270 } 613 }
271 614
272 615
273 def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: 616 def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]:
274 report = { 617 report = {
275 "generated_at": datetime.now().isoformat(timespec="seconds"), 618 "generated_at": datetime.now().isoformat(timespec="seconds"),
  619 + "suite": args.suite,
276 "environment": build_environment_info(), 620 "environment": build_environment_info(),
277 "scenarios": [], 621 "scenarios": [],
278 } 622 }
@@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]: @@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
296 scenario["scene"], 640 scenario["scene"],
297 "--warmup-batches", 641 "--warmup-batches",
298 str(args.warmup_batches), 642 str(args.warmup_batches),
  643 + "--suite",
  644 + args.suite,
  645 + "--serial-items-per-case",
  646 + str(args.serial_items_per_case),
  647 + "--concurrency-requests-per-case",
  648 + str(args.concurrency_requests_per_case),
  649 + "--concurrency-batch-size",
  650 + str(args.concurrency_batch_size),
  651 + "--max-batch-concurrency-product",
  652 + str(args.max_batch_concurrency_product),
299 ] 653 ]
300 if args.limit: 654 if args.limit:
301 cmd.extend(["--limit", str(args.limit)]) 655 cmd.extend(["--limit", str(args.limit)])
302 if args.batch_size: 656 if args.batch_size:
303 cmd.extend(["--batch-size", str(args.batch_size)]) 657 cmd.extend(["--batch-size", str(args.batch_size)])
  658 + if args.batch_size_list:
  659 + cmd.extend(["--batch-size-list", args.batch_size_list])
  660 + if args.concurrency_list:
  661 + cmd.extend(["--concurrency-list", args.concurrency_list])
304 if args.device_override: 662 if args.device_override:
305 cmd.extend(["--device-override", args.device_override]) 663 cmd.extend(["--device-override", args.device_override])
306 if args.torch_dtype_override: 664 if args.torch_dtype_override:
@@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]: @@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
311 cmd.extend(["--num-beams", str(args.num_beams)]) 669 cmd.extend(["--num-beams", str(args.num_beams)])
312 if args.attn_implementation: 670 if args.attn_implementation:
313 cmd.extend(["--attn-implementation", args.attn_implementation]) 671 cmd.extend(["--attn-implementation", args.attn_implementation])
  672 + if args.disable_cache:
  673 + cmd.append("--disable-cache")
314 674
315 completed = subprocess.run(cmd, capture_output=True, text=True, check=True) 675 completed = subprocess.run(cmd, capture_output=True, text=True, check=True)
316 result_line = "" 676 result_line = ""
@@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]: @@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
327 return report 687 return report
328 688
329 689
330 -def render_markdown_report(report: Dict[str, Any]) -> str: 690 +def render_baseline_markdown_report(report: Dict[str, Any]) -> str:
331 lines = [ 691 lines = [
332 "# Local Translation Model Benchmark", 692 "# Local Translation Model Benchmark",
333 "", 693 "",
334 f"- Generated at: `{report['generated_at']}`", 694 f"- Generated at: `{report['generated_at']}`",
  695 + f"- Suite: `{report['suite']}`",
335 f"- Python: `{report['environment']['python']}`", 696 f"- Python: `{report['environment']['python']}`",
336 f"- Torch: `{report['environment']['torch']}`", 697 f"- Torch: `{report['environment']['torch']}`",
337 f"- Transformers: `{report['environment']['transformers']}`", 698 f"- Transformers: `{report['environment']['transformers']}`",
@@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str: @@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
342 lines.extend( 703 lines.extend(
343 [ 704 [
344 "", 705 "",
345 - "| Scenario | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Load s | Peak GPU GiB | Success |", 706 + "| Scenario | Items/s | Avg item ms | Req p50 ms | Req p95 ms | Load s | Peak GPU GiB | Success |",
346 "|---|---:|---:|---:|---:|---:|---:|---:|", 707 "|---|---:|---:|---:|---:|---:|---:|---:|",
347 ] 708 ]
348 ) 709 )
349 for item in report["scenarios"]: 710 for item in report["scenarios"]:
350 runtime = item["runtime"] 711 runtime = item["runtime"]
351 lines.append( 712 lines.append(
352 - "| {name} | {items_per_second} | {avg_item_latency_ms} | {batch_latency_p50_ms} | {batch_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( 713 + "| {name} | {items_per_second} | {avg_item_latency_ms} | {request_latency_p50_ms} | {request_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format(
353 name=item["scenario"]["name"], 714 name=item["scenario"]["name"],
354 items_per_second=runtime["items_per_second"], 715 items_per_second=runtime["items_per_second"],
355 avg_item_latency_ms=runtime["avg_item_latency_ms"], 716 avg_item_latency_ms=runtime["avg_item_latency_ms"],
356 - batch_latency_p50_ms=runtime["batch_latency_p50_ms"],  
357 - batch_latency_p95_ms=runtime["batch_latency_p95_ms"], 717 + request_latency_p50_ms=runtime["request_latency_p50_ms"],
  718 + request_latency_p95_ms=runtime["request_latency_p95_ms"],
358 load_seconds=runtime["load_seconds"], 719 load_seconds=runtime["load_seconds"],
359 peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"], 720 peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"],
360 success_rate=runtime["success_rate"], 721 success_rate=runtime["success_rate"],
@@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str: @@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
375 f"- Load time: `{runtime['load_seconds']} s`", 736 f"- Load time: `{runtime['load_seconds']} s`",
376 f"- Translate time: `{runtime['translate_seconds']} s`", 737 f"- Translate time: `{runtime['translate_seconds']} s`",
377 f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`", 738 f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`",
378 - f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, batch p50 `{runtime['batch_latency_p50_ms']} ms`, batch p95 `{runtime['batch_latency_p95_ms']} ms`, batch max `{runtime['batch_latency_max_ms']} ms`", 739 + f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, req p50 `{runtime['request_latency_p50_ms']} ms`, req p95 `{runtime['request_latency_p95_ms']} ms`, req max `{runtime['request_latency_max_ms']} ms`",
379 f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`", 740 f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`",
380 f"- Success: `{runtime['success_count']}/{dataset['rows']}`", 741 f"- Success: `{runtime['success_count']}/{dataset['rows']}`",
381 "", 742 "",
@@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str: @@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
384 return "\n".join(lines) 745 return "\n".join(lines)
385 746
386 747
  748 +def render_case_table(
  749 + title: str,
  750 + rows: Sequence[Dict[str, Any]],
  751 + *,
  752 + include_batch: bool,
  753 + include_concurrency: bool,
  754 +) -> List[str]:
  755 + headers = ["Rows", "Requests", "Items/s", "Req/s", "Avg req ms", "Req p50 ms", "Req p95 ms", "Peak GPU GiB"]
  756 + prefix_headers: List[str] = []
  757 + if include_batch:
  758 + prefix_headers.append("Batch")
  759 + if include_concurrency:
  760 + prefix_headers.append("Concurrency")
  761 + headers = prefix_headers + headers
  762 + lines = [f"### {title}", ""]
  763 + lines.append("| " + " | ".join(headers) + " |")
  764 + lines.append("|" + "|".join(["---:"] * len(headers)) + "|")
  765 + for item in rows:
  766 + values: List[str] = []
  767 + if include_batch:
  768 + values.append(str(item["batch_size"]))
  769 + if include_concurrency:
  770 + values.append(str(item["concurrency"]))
  771 + values.extend(
  772 + [
  773 + str(item["rows"]),
  774 + str(item["requests"]),
  775 + str(item["items_per_second"]),
  776 + str(item["requests_per_second"]),
  777 + str(item["avg_request_latency_ms"]),
  778 + str(item["request_latency_p50_ms"]),
  779 + str(item["request_latency_p95_ms"]),
  780 + str(item["peak_gpu_memory_gb"]),
  781 + ]
  782 + )
  783 + lines.append("| " + " | ".join(values) + " |")
  784 + lines.append("")
  785 + return lines
  786 +
  787 +
  788 +def render_extended_markdown_report(report: Dict[str, Any]) -> str:
  789 + lines = [
  790 + "# Local Translation Model Extended Benchmark",
  791 + "",
  792 + f"- Generated at: `{report['generated_at']}`",
  793 + f"- Suite: `{report['suite']}`",
  794 + f"- Python: `{report['environment']['python']}`",
  795 + f"- Torch: `{report['environment']['torch']}`",
  796 + f"- Transformers: `{report['environment']['transformers']}`",
  797 + f"- CUDA: `{report['environment']['cuda_available']}`",
  798 + ]
  799 + if report["environment"]["gpu_name"]:
  800 + lines.append(f"- GPU: `{report['environment']['gpu_name']}` ({report['environment']['gpu_total_mem_gb']} GiB)")
  801 +
  802 + lines.extend(
  803 + [
  804 + "",
  805 + "## Reading Guide",
  806 + "",
  807 + "- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.",
  808 + "- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.",
  809 + "- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.",
  810 + "",
  811 + ]
  812 + )
  813 +
  814 + for item in report["scenarios"]:
  815 + lines.extend(
  816 + [
  817 + f"## {item['scenario']['name']}",
  818 + "",
  819 + f"- Direction: `{item['scenario']['source_lang']} -> {item['scenario']['target_lang']}`",
  820 + f"- Column: `{item['scenario']['column']}`",
  821 + f"- Loaded rows: `{item['dataset']['rows_loaded']}`",
  822 + f"- Load time: `{item['runtime_defaults']['load_seconds']} s`",
  823 + f"- Device: `{item['runtime_defaults']['device']}`",
  824 + f"- DType: `{item['runtime_defaults']['torch_dtype']}`",
  825 + f"- Cache disabled: `{item['config']['cache_disabled']}`",
  826 + "",
  827 + ]
  828 + )
  829 + lines.extend(render_case_table("Batch Sweep (`concurrency=1`)", item["batch_sweep"], include_batch=True, include_concurrency=False))
  830 + lines.extend(
  831 + render_case_table(
  832 + f"Concurrency Sweep (`batch_size={item['config']['concurrency_batch_size']}`)",
  833 + item["concurrency_sweep"],
  834 + include_batch=False,
  835 + include_concurrency=True,
  836 + )
  837 + )
  838 + lines.extend(render_case_table("Batch x Concurrency Matrix", item["matrix"], include_batch=True, include_concurrency=True))
  839 + return "\n".join(lines)
  840 +
  841 +
  842 +def render_markdown_report(report: Dict[str, Any]) -> str:
  843 + if report["suite"] == "extended":
  844 + return render_extended_markdown_report(report)
  845 + return render_baseline_markdown_report(report)
  846 +
  847 +
387 def main() -> None: 848 def main() -> None:
388 args = parse_args() 849 args = parse_args()
389 if args.single: 850 if args.single:
390 - result = benchmark_single_scenario(args) 851 + if args.suite == "extended":
  852 + result = benchmark_extended_scenario(args)
  853 + else:
  854 + result = benchmark_single_scenario(args)
391 print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False)) 855 print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False))
392 return 856 return
393 857
394 report = run_all_scenarios(args) 858 report = run_all_scenarios(args)
395 output_dir = resolve_output_dir(args.output_dir) 859 output_dir = resolve_output_dir(args.output_dir)
396 timestamp = datetime.now().strftime("%H%M%S") 860 timestamp = datetime.now().strftime("%H%M%S")
397 - json_path = output_dir / f"translation_local_models_{timestamp}.json"  
398 - md_path = output_dir / f"translation_local_models_{timestamp}.md" 861 + suffix = "extended" if args.suite == "extended" else "baseline"
  862 + json_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.json"
  863 + md_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.md"
399 json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") 864 json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
400 md_path.write_text(render_markdown_report(report), encoding="utf-8") 865 md_path.write_text(render_markdown_report(report), encoding="utf-8")
401 866
402 print(f"JSON report: {json_path}") 867 print(f"JSON report: {json_path}")
403 print(f"Markdown report: {md_path}") 868 print(f"Markdown report: {md_path}")
404 for item in report["scenarios"]: 869 for item in report["scenarios"]:
405 - runtime = item["runtime"]  
406 - print(  
407 - f"{item['scenario']['name']}: "  
408 - f"{runtime['items_per_second']} items/s | "  
409 - f"avg_item={runtime['avg_item_latency_ms']} ms | "  
410 - f"p95_batch={runtime['batch_latency_p95_ms']} ms | "  
411 - f"load={runtime['load_seconds']} s"  
412 - ) 870 + if args.suite == "extended":
  871 + best_batch = max(item["batch_sweep"], key=lambda x: x["items_per_second"])
  872 + best_concurrency = max(item["concurrency_sweep"], key=lambda x: x["items_per_second"])
  873 + print(
  874 + f"{item['scenario']['name']}: "
  875 + f"best_batch={best_batch['batch_size']} ({best_batch['items_per_second']} items/s) | "
  876 + f"best_concurrency={best_concurrency['concurrency']} ({best_concurrency['items_per_second']} items/s @ batch={best_concurrency['batch_size']})"
  877 + )
  878 + else:
  879 + runtime = item["runtime"]
  880 + print(
  881 + f"{item['scenario']['name']}: "
  882 + f"{runtime['items_per_second']} items/s | "
  883 + f"avg_item={runtime['avg_item_latency_ms']} ms | "
  884 + f"p95_req={runtime['request_latency_p95_ms']} ms | "
  885 + f"load={runtime['load_seconds']} s"
  886 + )
413 887
414 888
415 if __name__ == "__main__": 889 if __name__ == "__main__":
translation/README.md
@@ -13,7 +13,7 @@ @@ -13,7 +13,7 @@
13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh) 13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh)
14 - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py) 14 - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py)
15 - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) 15 - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
16 -- 性能报告:[`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) 16 +- 性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
17 17
18 ## 1. 设计目标 18 ## 1. 设计目标
19 19
@@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \ @@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \
530 数据集: 530 数据集:
531 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) 531 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
532 532
  533 +最新报告:
  534 +- 摘要:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
  535 +- 完整 Markdown:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  536 +- 完整 JSON:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
  537 +
  538 +### 10.1 先看哪组数据
  539 +
  540 +这里把 3 类结果分开看,不再混在一张表里:
  541 +
  542 +- `batch_sweep`
  543 + 固定 `concurrency=1`,只比较不同 `batch_size` 的单流批处理性能
  544 +- `concurrency_sweep`
  545 + 固定 `batch_size=1`,看“单条请求”在不同并发下的延迟和吞吐
  546 +- `batch x concurrency matrix`
  547 + 同时看 `batch_size` 和 `concurrency` 的交互效应;本轮限制为 `batch_size * concurrency <= 128`
  548 +
  549 +建议:
  550 +
  551 +- 看线上 query 翻译延迟:优先看 `concurrency_sweep`
  552 +- 看离线批量翻译吞吐:优先看 `batch_sweep`
  553 +- 看单 worker 服务容量边界:再看 `batch x concurrency matrix`
  554 +
  555 +### 10.2 本轮补测参数
  556 +
  557 +测试时间:`2026-03-18`
  558 +
  559 +环境:
  560 +- GPU:`Tesla T4 16GB`
  561 +- Python env:`.venv-translator`
  562 +- Torch / Transformers:`2.10.0+cu128 / 5.3.0`
  563 +
  564 +统一参数:
  565 +- cache:关闭(`--disable-cache`),避免缓存命中干扰性能结果
  566 +- `batch_sweep`:每档 `256` items
  567 +- `concurrency_sweep`:固定 `batch_size=1`,每档 `32` requests
  568 +- `batch x concurrency matrix`:每档 `32` requests,且只保留 `batch_size * concurrency <= 128`
  569 +- 预热:`1` batch
  570 +
533 复现命令: 571 复现命令:
534 572
535 ```bash 573 ```bash
@@ -537,16 +575,36 @@ cd /data/saas-search @@ -537,16 +575,36 @@ cd /data/saas-search
537 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py 575 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py
538 ``` 576 ```
539 577
540 -单模型复现示例: 578 +本轮扩展压测复现命令:
  579 +
  580 +```bash
  581 +cd /data/saas-search
  582 +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  583 + --suite extended \
  584 + --disable-cache \
  585 + --serial-items-per-case 256 \
  586 + --concurrency-requests-per-case 32 \
  587 + --concurrency-batch-size 1 \
  588 + --output-dir perf_reports/20260318/translation_local_models
  589 +```
  590 +
  591 +单模型扩展压测示例:
541 592
542 ```bash 593 ```bash
543 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 594 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
544 --single \ 595 --single \
  596 + --suite extended \
545 --model opus-mt-zh-en \ 597 --model opus-mt-zh-en \
546 --source-lang zh \ 598 --source-lang zh \
547 --target-lang en \ 599 --target-lang en \
548 --column title_cn \ 600 --column title_cn \
549 - --scene sku_name 601 + --scene sku_name \
  602 + --disable-cache \
  603 + --batch-size-list 1,4,8,16,32,64 \
  604 + --concurrency-list 1,2,4,8,16,64 \
  605 + --serial-items-per-case 256 \
  606 + --concurrency-requests-per-case 32 \
  607 + --concurrency-batch-size 1
550 ``` 608 ```
551 609
552 单条请求延迟复现: 610 单条请求延迟复现:
@@ -554,37 +612,143 @@ cd /data/saas-search @@ -554,37 +612,143 @@ cd /data/saas-search
554 ```bash 612 ```bash
555 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 613 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
556 --single \ 614 --single \
  615 + --suite extended \
557 --model nllb-200-distilled-600m \ 616 --model nllb-200-distilled-600m \
558 --source-lang zh \ 617 --source-lang zh \
559 --target-lang en \ 618 --target-lang en \
560 --column title_cn \ 619 --column title_cn \
561 --scene sku_name \ 620 --scene sku_name \
562 - --batch-size 1 \  
563 - --limit 100 621 + --disable-cache \
  622 + --batch-size-list 1 \
  623 + --concurrency-list 1,2,4,8,16,64 \
  624 + --serial-items-per-case 256 \
  625 + --concurrency-requests-per-case 32 \
  626 + --concurrency-batch-size 1
564 ``` 627 ```
565 628
566 -说明:  
567 -- 对当前脚本和本地 backend 来说,“单条请求”可以直接等价理解为 `batch_size=1`  
568 -- 此时脚本里的 `batch_latency_*`,就可以直接视为“单次请求延迟”指标  
569 -- 线上搜索 query 翻译更应该关注这组数据,而不是大 batch 吞吐 629 +### 10.3 单流 batch 结果
  630 +
  631 +这组只看 `concurrency=1`,不要把这里的 `request p95` 当作线上并发请求的 p95。
  632 +
  633 +`nllb-200-distilled-600m zh -> en`
  634 +
  635 +| Batch | Items/s | Avg item ms | Req p95 ms |
  636 +|---:|---:|---:|---:|
  637 +| 1 | 2.91 | 343.488 | 616.27 |
  638 +| 4 | 8.44 | 118.545 | 722.95 |
  639 +| 8 | 14.85 | 67.335 | 728.47 |
  640 +| 16 | 27.28 | 36.662 | 769.18 |
  641 +| 32 | 38.6 | 25.908 | 1369.88 |
  642 +| 64 | 58.3 | 17.152 | 1659.9 |
  643 +
  644 +`nllb-200-distilled-600m en -> zh`
  645 +
  646 +| Batch | Items/s | Avg item ms | Req p95 ms |
  647 +|---:|---:|---:|---:|
  648 +| 1 | 1.91 | 524.917 | 866.33 |
  649 +| 4 | 4.94 | 202.473 | 1599.74 |
  650 +| 8 | 8.25 | 121.188 | 1632.29 |
  651 +| 16 | 13.52 | 73.956 | 1649.65 |
  652 +| 32 | 21.27 | 47.017 | 1827.16 |
  653 +| 64 | 32.64 | 30.641 | 2031.25 |
570 654
571 -当前单条请求实测(`Tesla T4`,`limit=100`):  
572 -- `nllb-200-distilled-600m zh->en`:p50 约 `292.54 ms`,p95 约 `624.12 ms`,平均约 `321.91 ms`  
573 -- `nllb-200-distilled-600m en->zh`:p50 约 `481.61 ms`,p95 约 `1171.71 ms`,平均约 `542.47 ms` 655 +`opus-mt-zh-en zh -> en`
574 656
575 -当前压测环境:  
576 -- GPU:`Tesla T4 16GB`  
577 -- Python env:`.venv-translator`  
578 -- 数据量:`18,576` 条商品标题 657 +| Batch | Items/s | Avg item ms | Req p95 ms |
  658 +|---:|---:|---:|---:|
  659 +| 1 | 6.15 | 162.536 | 274.74 |
  660 +| 4 | 15.34 | 65.192 | 356.0 |
  661 +| 8 | 25.51 | 39.202 | 379.84 |
  662 +| 16 | 41.44 | 24.129 | 797.93 |
  663 +| 32 | 54.36 | 18.397 | 1693.31 |
  664 +| 64 | 70.15 | 14.255 | 2161.59 |
  665 +
  666 +`opus-mt-en-zh en -> zh`
  667 +
  668 +| Batch | Items/s | Avg item ms | Req p95 ms |
  669 +|---:|---:|---:|---:|
  670 +| 1 | 4.53 | 220.598 | 411.57 |
  671 +| 4 | 10.12 | 98.844 | 761.49 |
  672 +| 8 | 14.63 | 68.361 | 1930.85 |
  673 +| 16 | 24.33 | 41.1 | 2098.54 |
  674 +| 32 | 33.91 | 29.487 | 2152.28 |
  675 +| 64 | 42.47 | 23.547 | 2371.85 |
  676 +
  677 +批处理结论:
  678 +
  679 +- 纯吞吐看,4 个方向的最佳 raw throughput 都出现在 `batch_size=64`
  680 +- 如果还要兼顾单个 batch 的尾延迟,`batch_size=16` 往往更均衡
  681 +- `opus-mt-zh-en` 是本轮 bulk 场景最快模型,`nllb en->zh` 最慢
  682 +
  683 +### 10.4 单条请求并发结果
  684 +
  685 +这组固定 `batch_size=1`,可以直接理解成“单条请求在不同并发下的表现”。
  686 +
  687 +`nllb-200-distilled-600m zh -> en`
  688 +
  689 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  690 +|---:|---:|---:|---:|---:|
  691 +| 1 | 4.17 | 239.99 | 226.34 | 373.27 |
  692 +| 2 | 4.1 | 477.99 | 459.36 | 703.96 |
  693 +| 4 | 4.1 | 910.74 | 884.71 | 1227.01 |
  694 +| 8 | 4.04 | 1697.73 | 1818.48 | 2383.8 |
  695 +| 16 | 4.07 | 2801.91 | 3473.63 | 4145.92 |
  696 +| 64 | 4.04 | 3714.49 | 3610.08 | 7337.3 |
  697 +
  698 +`nllb-200-distilled-600m en -> zh`
  699 +
  700 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  701 +|---:|---:|---:|---:|---:|
  702 +| 1 | 2.16 | 463.18 | 439.54 | 670.78 |
  703 +| 2 | 2.15 | 920.48 | 908.27 | 1213.3 |
  704 +| 4 | 2.16 | 1759.87 | 1771.58 | 2158.04 |
  705 +| 8 | 2.15 | 3284.44 | 3658.45 | 3971.01 |
  706 +| 16 | 2.14 | 5669.15 | 7117.7 | 7522.48 |
  707 +| 64 | 2.14 | 7631.14 | 7510.97 | 14139.03 |
  708 +
  709 +`opus-mt-zh-en zh -> en`
  710 +
  711 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  712 +|---:|---:|---:|---:|---:|
  713 +| 1 | 9.21 | 108.53 | 91.7 | 179.12 |
  714 +| 2 | 8.92 | 219.19 | 212.29 | 305.34 |
  715 +| 4 | 9.09 | 411.76 | 420.08 | 583.97 |
  716 +| 8 | 8.85 | 784.14 | 835.73 | 1043.06 |
  717 +| 16 | 9.01 | 1278.4 | 1483.34 | 1994.56 |
  718 +| 64 | 8.82 | 1687.08 | 1563.48 | 3381.58 |
  719 +
  720 +`opus-mt-en-zh en -> zh`
  721 +
  722 +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
  723 +|---:|---:|---:|---:|---:|
  724 +| 1 | 3.6 | 277.73 | 145.85 | 1180.37 |
  725 +| 2 | 3.55 | 559.38 | 346.71 | 1916.96 |
  726 +| 4 | 3.53 | 997.71 | 721.04 | 2944.17 |
  727 +| 8 | 3.51 | 1644.28 | 1590.93 | 3632.99 |
  728 +| 16 | 3.5 | 2600.18 | 2586.34 | 5554.04 |
  729 +| 64 | 3.52 | 3366.52 | 2780.0 | 7950.41 |
579 730
580 -最终性能结果: 731 +并发结论:
  732 +
  733 +- 当前本地 seq2seq backend 内部是单模型锁,单 worker 下提高客户端并发基本不会提升吞吐,主要会把等待时间堆到请求延迟上
  734 +- 线上 query 翻译如果追求稳定延迟,应优先控制在低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
  735 +- 在线场景里,`opus-mt-zh-en` 延迟最稳;`nllb en->zh` 最慢,且并发放大后尾延迟最明显
  736 +
  737 +### 10.5 batch x concurrency 怎么看
  738 +
  739 +完整矩阵见:
  740 +- [`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
  741 +
  742 +这张表主要回答两个问题:
581 743
582 -| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |  
583 -|---|---|---:|---:|---:|---:|---:|---:|---:|---:|  
584 -| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |  
585 -| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |  
586 -| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |  
587 -| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 | 744 +- 如果已经知道自己要跑离线批处理,`batch_size` 拉大后,在不同并发下吞吐会不会继续涨
  745 +- 如果要拿单 worker 服务扛请求,在哪个 `batch_size x concurrency` 组合下开始明显排队
  746 +
  747 +本轮矩阵的共同特征:
  748 +
  749 +- 吞吐主要由 `batch_size` 决定,`concurrency` 不是主要增益来源
  750 +- 在 `batch_size` 固定时,`concurrency` 从 `1` 升到 `2/4/8/...`,`items/s` 变化很小,但 `avg req ms / p95` 会持续抬升
  751 +- 因此当前实现更像“单 worker + 内部串行 GPU 推理服务”,不是一个靠客户端并发放大吞吐的服务
588 752
589 NLLB 性能优化经验: 753 NLLB 性能优化经验:
590 754
@@ -632,7 +796,7 @@ NLLB 性能优化经验: @@ -632,7 +796,7 @@ NLLB 性能优化经验:
632 - 运行方式:单 worker,避免重复加载 796 - 运行方式:单 worker,避免重复加载
633 797
634 更详细的性能说明见: 798 更详细的性能说明见:
635 -- [`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) 799 +- [`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
636 800
637 ## 11. 开发说明 801 ## 11. 开发说明
638 802