Commit 2a6d9d76f52556f65e4c3291fc402660f6a21817
1 parent
cd4ce66d
更新了压测脚本和文档,让“单条请求 / 单流 batch / 并发 /
batch×并发矩阵”彻底分开展示。 改动在这几处: scripts/benchmark_translation_local_models.py:新增 --suite extended,支持 batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size * concurrency <= 128 的组合矩阵;并且单场景模式现在只加载目标模型,load_seconds 更干净,也支持 --disable-cache。 translation/README.md:把性能章节拆成了 batch_sweep、concurrency_sweep、batch x concurrency matrix 三块,补了这次复测的参数、复现命令和摘要表。 perf_reports/20260318/translation_local_models/README.md:新增本轮补测摘要。 完整结果在 translation_local_models_extended_221846.md 和 translation_local_models_extended_221846.json。 这次补测的核心结论很明确: 在线单条请求应该看 concurrency_sweep,也就是固定 batch_size=1 的表。 离线批量吞吐应该看 batch_sweep,4 个方向的最高 raw throughput 都出现在 batch_size=64,但更均衡的默认值仍更像 batch_size=16。 当前本地 seq2seq backend 有单模型锁,提升客户端并发几乎不涨吞吐,主要是把排队时间变成更高的 p95;所以并发更像“延迟预算”问题,不是“扩容吞吐”手段。 本轮在线单条里最快的是 opus-mt-zh-en;最慢、且并发放大最明显的是 nllb-200-distilled-600m en->zh。
Showing
17 changed files
with
1138 additions
and
117 deletions
Show diff stats
docs/TODO.txt
| 1 | 1 | |
| 2 | 2 | |
| 3 | -product_enrich : Partial Mode | |
| 3 | + | |
| 4 | + | |
| 5 | +nllb-200-distilled-600M性能优化 | |
| 6 | +请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型,有哪些性能优化方案,提高线上翻译服务的吞吐量、降低耗时,搜索相关的在线推理服务方案,找到高性能的服务化方法 | |
| 7 | + | |
| 8 | +cnclip的性能优化 | |
| 9 | + | |
| 10 | +rerank 性能优化 | |
| 11 | + | |
| 12 | + | |
| 13 | +超时 | |
| 14 | +Query 分析阶段等待翻译/embedding 的硬超时 | |
| 15 | +配置文件位置:config/config.yaml | |
| 16 | +配置项:query_config.async_wait_timeout_ms: 80 | |
| 17 | +代码生效点:query/query_parser.py 使用该值换算成秒传给 wait(...) | |
| 18 | +2) Embedding HTTP 调用超时(Text/Image) | |
| 19 | +不再使用任何环境变量覆盖(之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用) | |
| 20 | +配置文件位置:config/config.yaml | |
| 21 | +配置项:services.embedding.providers.http.timeout_sec(已在 YAML 里补了示例默认 60) | |
| 22 | +代码生效点: | |
| 23 | +embeddings/text_encoder.py:requests.post(..., timeout=self.timeout_sec) | |
| 24 | +embeddings/image_encoder.py:requests.post(..., timeout=self.timeout_sec) | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | +product_enrich : Partial Mode : done | |
| 4 | 30 | https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR |
| 5 | 31 | 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下: |
| 6 | 32 | [ |
| ... | ... | @@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men |
| 15 | 41 | } |
| 16 | 42 | ] |
| 17 | 43 | 模型会以前缀内容为起点开始生成。 |
| 18 | - | |
| 19 | 44 | 支持 非思考模式。 |
| 20 | 45 | |
| 21 | 46 | |
| ... | ... | @@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men |
| 41 | 66 | |
| 42 | 67 | |
| 43 | 68 | |
| 44 | -翻译的cache需要重构 | |
| 45 | - | |
| 46 | - | |
| 47 | - | |
| 48 | - | |
| 49 | - | |
| 50 | 69 | |
| 51 | 70 | suggest 索引,现在是全量脚本,要交给金伟 |
| 52 | 71 | ... | ... |
perf_reports/20260318/translation_local_models/README.md
0 → 100644
| ... | ... | @@ -0,0 +1,101 @@ |
| 1 | +# Local Translation Model Benchmark Report | |
| 2 | + | |
| 3 | +测试脚本: | |
| 4 | +- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) | |
| 5 | + | |
| 6 | +完整结果: | |
| 7 | +- Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) | |
| 8 | +- JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) | |
| 9 | + | |
| 10 | +测试时间: | |
| 11 | +- `2026-03-18` | |
| 12 | + | |
| 13 | +环境: | |
| 14 | +- GPU:`Tesla T4 16GB` | |
| 15 | +- Driver / CUDA:`570.158.01 / 12.8` | |
| 16 | +- Python env:`.venv-translator` | |
| 17 | +- Torch / Transformers:`2.10.0+cu128 / 5.3.0` | |
| 18 | +- 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) | |
| 19 | + | |
| 20 | +## Method | |
| 21 | + | |
| 22 | +这轮把结果拆成 3 类: | |
| 23 | + | |
| 24 | +- `batch_sweep` | |
| 25 | + 固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64` | |
| 26 | +- `concurrency_sweep` | |
| 27 | + 固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64` | |
| 28 | +- `batch x concurrency matrix` | |
| 29 | + 组合压测,保留 `batch_size * concurrency <= 128` | |
| 30 | + | |
| 31 | +统一设定: | |
| 32 | +- 关闭 cache:`--disable-cache` | |
| 33 | +- `batch_sweep`:每档 `256` items | |
| 34 | +- `concurrency_sweep`:每档 `32` requests | |
| 35 | +- `matrix`:每档 `32` requests | |
| 36 | +- 预热:`1` batch | |
| 37 | + | |
| 38 | +复现命令: | |
| 39 | + | |
| 40 | +```bash | |
| 41 | +cd /data/saas-search | |
| 42 | +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | |
| 43 | + --suite extended \ | |
| 44 | + --disable-cache \ | |
| 45 | + --serial-items-per-case 256 \ | |
| 46 | + --concurrency-requests-per-case 32 \ | |
| 47 | + --concurrency-batch-size 1 \ | |
| 48 | + --output-dir perf_reports/20260318/translation_local_models | |
| 49 | +``` | |
| 50 | + | |
| 51 | +## Key Results | |
| 52 | + | |
| 53 | +### 1. 单流 batch sweep | |
| 54 | + | |
| 55 | +| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms | | |
| 56 | +|---|---|---:|---:|---:|---:| | |
| 57 | +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` | | |
| 58 | +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` | | |
| 59 | +| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` | | |
| 60 | +| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` | | |
| 61 | + | |
| 62 | +解读: | |
| 63 | +- 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高 | |
| 64 | +- 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选 | |
| 65 | +- 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh` | |
| 66 | + | |
| 67 | +### 2. 单条请求并发 sweep | |
| 68 | + | |
| 69 | +| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms | | |
| 70 | +|---|---|---:|---:|---:|---:| | |
| 71 | +| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` | | |
| 72 | +| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` | | |
| 73 | +| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` | | |
| 74 | +| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` | | |
| 75 | + | |
| 76 | +解读: | |
| 77 | +- `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟 | |
| 78 | +- 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 | |
| 79 | +- 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh` | |
| 80 | + | |
| 81 | +### 3. batch x concurrency matrix | |
| 82 | + | |
| 83 | +最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下): | |
| 84 | + | |
| 85 | +| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms | | |
| 86 | +|---|---|---:|---:|---:|---:|---:| | |
| 87 | +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` | | |
| 88 | +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` | | |
| 89 | +| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` | | |
| 90 | +| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` | | |
| 91 | + | |
| 92 | +解读: | |
| 93 | +- 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定 | |
| 94 | +- 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升 | |
| 95 | +- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐 | |
| 96 | + | |
| 97 | +## Recommendation | |
| 98 | + | |
| 99 | +- 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径 | |
| 100 | +- 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64` | |
| 101 | +- 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段 | ... | ... |
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md
0 → 100644
| ... | ... | @@ -0,0 +1,263 @@ |
| 1 | +# Local Translation Model Extended Benchmark | |
| 2 | + | |
| 3 | +- Generated at: `2026-03-18T21:28:09` | |
| 4 | +- Suite: `extended` | |
| 5 | +- Python: `3.12.3` | |
| 6 | +- Torch: `2.10.0+cu128` | |
| 7 | +- Transformers: `5.3.0` | |
| 8 | +- CUDA: `True` | |
| 9 | +- GPU: `Tesla T4` (15.56 GiB) | |
| 10 | + | |
| 11 | +## Reading Guide | |
| 12 | + | |
| 13 | +- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes. | |
| 14 | +- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises. | |
| 15 | +- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured. | |
| 16 | + | |
| 17 | +## nllb-200-distilled-600m zh->en | |
| 18 | + | |
| 19 | +- Direction: `zh -> en` | |
| 20 | +- Column: `title_cn` | |
| 21 | +- Loaded rows: `2048` | |
| 22 | +- Load time: `6.118 s` | |
| 23 | +- Device: `cuda` | |
| 24 | +- DType: `torch.float16` | |
| 25 | +- Cache disabled: `True` | |
| 26 | + | |
| 27 | +### Batch Sweep (`concurrency=1`) | |
| 28 | + | |
| 29 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 30 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 31 | +| 1 | 256 | 256 | 2.91 | 2.91 | 343.48 | 315.98 | 616.27 | 2.633 | | |
| 32 | +| 4 | 256 | 64 | 8.44 | 2.11 | 474.17 | 474.75 | 722.95 | 2.659 | | |
| 33 | +| 8 | 256 | 32 | 14.85 | 1.86 | 538.68 | 564.97 | 728.47 | 2.699 | | |
| 34 | +| 16 | 256 | 16 | 27.28 | 1.7 | 586.59 | 633.16 | 769.18 | 2.765 | | |
| 35 | +| 32 | 256 | 8 | 38.6 | 1.21 | 829.04 | 761.74 | 1369.88 | 2.987 | | |
| 36 | +| 64 | 256 | 4 | 58.3 | 0.91 | 1097.73 | 919.35 | 1659.9 | 3.347 | | |
| 37 | + | |
| 38 | +### Concurrency Sweep (`batch_size=1`) | |
| 39 | + | |
| 40 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 41 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 42 | +| 1 | 32 | 32 | 4.17 | 4.17 | 239.99 | 226.34 | 373.27 | 3.347 | | |
| 43 | +| 2 | 32 | 32 | 4.1 | 4.1 | 477.99 | 459.36 | 703.96 | 3.347 | | |
| 44 | +| 4 | 32 | 32 | 4.1 | 4.1 | 910.74 | 884.71 | 1227.01 | 3.347 | | |
| 45 | +| 8 | 32 | 32 | 4.04 | 4.04 | 1697.73 | 1818.48 | 2383.8 | 3.347 | | |
| 46 | +| 16 | 32 | 32 | 4.07 | 4.07 | 2801.91 | 3473.63 | 4145.92 | 3.347 | | |
| 47 | +| 64 | 32 | 32 | 4.04 | 4.04 | 3714.49 | 3610.08 | 7337.3 | 3.347 | | |
| 48 | + | |
| 49 | +### Batch x Concurrency Matrix | |
| 50 | + | |
| 51 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 52 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 53 | +| 1 | 1 | 32 | 32 | 3.94 | 3.94 | 253.96 | 231.8 | 378.76 | 3.347 | | |
| 54 | +| 1 | 2 | 32 | 32 | 3.92 | 3.92 | 500.35 | 480.18 | 747.3 | 3.347 | | |
| 55 | +| 1 | 4 | 32 | 32 | 4.05 | 4.05 | 922.53 | 894.57 | 1235.7 | 3.347 | | |
| 56 | +| 1 | 8 | 32 | 32 | 4.08 | 4.08 | 1679.71 | 1806.95 | 2346.11 | 3.347 | | |
| 57 | +| 1 | 16 | 32 | 32 | 4.05 | 4.05 | 2816.51 | 3485.68 | 4181.39 | 3.347 | | |
| 58 | +| 1 | 64 | 32 | 32 | 4.06 | 4.06 | 3711.9 | 3614.34 | 7319.52 | 3.347 | | |
| 59 | +| 4 | 1 | 128 | 32 | 9.12 | 2.28 | 438.42 | 386.12 | 724.47 | 3.347 | | |
| 60 | +| 4 | 2 | 128 | 32 | 9.2 | 2.3 | 852.37 | 756.47 | 1366.7 | 3.347 | | |
| 61 | +| 4 | 4 | 128 | 32 | 9.17 | 2.29 | 1642.9 | 1524.56 | 2662.49 | 3.347 | | |
| 62 | +| 4 | 8 | 128 | 32 | 9.18 | 2.29 | 2974.7 | 3168.19 | 4594.99 | 3.347 | | |
| 63 | +| 4 | 16 | 128 | 32 | 9.23 | 2.31 | 4897.43 | 5815.91 | 8210.23 | 3.347 | | |
| 64 | +| 8 | 1 | 256 | 32 | 14.73 | 1.84 | 543.03 | 556.97 | 734.67 | 3.347 | | |
| 65 | +| 8 | 2 | 256 | 32 | 14.73 | 1.84 | 1064.16 | 1132.78 | 1405.53 | 3.347 | | |
| 66 | +| 8 | 4 | 256 | 32 | 14.79 | 1.85 | 2035.46 | 2237.69 | 2595.53 | 3.347 | | |
| 67 | +| 8 | 8 | 256 | 32 | 14.76 | 1.84 | 3763.14 | 4345.6 | 5025.02 | 3.347 | | |
| 68 | +| 8 | 16 | 256 | 32 | 14.77 | 1.85 | 6383.08 | 8053.36 | 9380.87 | 3.347 | | |
| 69 | +| 16 | 1 | 512 | 32 | 24.88 | 1.55 | 643.11 | 661.91 | 814.26 | 3.347 | | |
| 70 | +| 16 | 2 | 512 | 32 | 24.91 | 1.56 | 1265.17 | 1326.85 | 1576.96 | 3.347 | | |
| 71 | +| 16 | 4 | 512 | 32 | 24.94 | 1.56 | 2446.55 | 2594.26 | 3027.37 | 3.347 | | |
| 72 | +| 16 | 8 | 512 | 32 | 24.8 | 1.55 | 4548.92 | 5035.56 | 5852.87 | 3.347 | | |
| 73 | +| 32 | 1 | 1024 | 32 | 34.9 | 1.09 | 916.78 | 823.68 | 1637.59 | 3.347 | | |
| 74 | +| 32 | 2 | 1024 | 32 | 34.98 | 1.09 | 1807.39 | 1717.89 | 2514.27 | 3.347 | | |
| 75 | +| 32 | 4 | 1024 | 32 | 34.85 | 1.09 | 3513.42 | 3779.99 | 4750.34 | 3.347 | | |
| 76 | +| 64 | 1 | 2048 | 32 | 53.24 | 0.83 | 1202.07 | 993.59 | 1838.72 | 3.642 | | |
| 77 | +| 64 | 2 | 2048 | 32 | 53.95 | 0.84 | 2344.92 | 2465.53 | 3562.04 | 3.642 | | |
| 78 | + | |
| 79 | +## nllb-200-distilled-600m en->zh | |
| 80 | + | |
| 81 | +- Direction: `en -> zh` | |
| 82 | +- Column: `title` | |
| 83 | +- Loaded rows: `2048` | |
| 84 | +- Load time: `6.137 s` | |
| 85 | +- Device: `cuda` | |
| 86 | +- DType: `torch.float16` | |
| 87 | +- Cache disabled: `True` | |
| 88 | + | |
| 89 | +### Batch Sweep (`concurrency=1`) | |
| 90 | + | |
| 91 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 92 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 93 | +| 1 | 256 | 256 | 1.91 | 1.91 | 524.91 | 497.81 | 866.33 | 2.636 | | |
| 94 | +| 4 | 256 | 64 | 4.94 | 1.23 | 809.89 | 743.87 | 1599.74 | 2.684 | | |
| 95 | +| 8 | 256 | 32 | 8.25 | 1.03 | 969.5 | 796.64 | 1632.29 | 2.723 | | |
| 96 | +| 16 | 256 | 16 | 13.52 | 0.85 | 1183.29 | 1169.28 | 1649.65 | 2.817 | | |
| 97 | +| 32 | 256 | 8 | 21.27 | 0.66 | 1504.54 | 1647.92 | 1827.16 | 3.007 | | |
| 98 | +| 64 | 256 | 4 | 32.64 | 0.51 | 1961.02 | 1985.8 | 2031.25 | 3.408 | | |
| 99 | + | |
| 100 | +### Concurrency Sweep (`batch_size=1`) | |
| 101 | + | |
| 102 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 103 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 104 | +| 1 | 32 | 32 | 2.16 | 2.16 | 463.18 | 439.54 | 670.78 | 3.408 | | |
| 105 | +| 2 | 32 | 32 | 2.15 | 2.15 | 920.48 | 908.27 | 1213.3 | 3.408 | | |
| 106 | +| 4 | 32 | 32 | 2.16 | 2.16 | 1759.87 | 1771.58 | 2158.04 | 3.408 | | |
| 107 | +| 8 | 32 | 32 | 2.15 | 2.15 | 3284.44 | 3658.45 | 3971.01 | 3.408 | | |
| 108 | +| 16 | 32 | 32 | 2.14 | 2.14 | 5669.15 | 7117.7 | 7522.48 | 3.408 | | |
| 109 | +| 64 | 32 | 32 | 2.14 | 2.14 | 7631.14 | 7510.97 | 14139.03 | 3.408 | | |
| 110 | + | |
| 111 | +### Batch x Concurrency Matrix | |
| 112 | + | |
| 113 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 114 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 115 | +| 1 | 1 | 32 | 32 | 2.17 | 2.17 | 461.18 | 439.28 | 673.59 | 3.408 | | |
| 116 | +| 1 | 2 | 32 | 32 | 2.15 | 2.15 | 919.12 | 904.83 | 1209.83 | 3.408 | | |
| 117 | +| 1 | 4 | 32 | 32 | 2.14 | 2.14 | 1771.34 | 1810.99 | 2171.71 | 3.408 | | |
| 118 | +| 1 | 8 | 32 | 32 | 2.12 | 2.12 | 3324.85 | 3704.57 | 3999.29 | 3.408 | | |
| 119 | +| 1 | 16 | 32 | 32 | 2.15 | 2.15 | 5651.09 | 7121.8 | 7495.67 | 3.408 | | |
| 120 | +| 1 | 64 | 32 | 32 | 2.15 | 2.15 | 7619.16 | 7505.14 | 14111.53 | 3.408 | | |
| 121 | +| 4 | 1 | 128 | 32 | 5.13 | 1.28 | 780.0 | 661.74 | 1586.9 | 3.408 | | |
| 122 | +| 4 | 2 | 128 | 32 | 5.09 | 1.27 | 1551.81 | 1368.81 | 2775.92 | 3.408 | | |
| 123 | +| 4 | 4 | 128 | 32 | 5.13 | 1.28 | 2995.57 | 2836.77 | 4532.67 | 3.408 | | |
| 124 | +| 4 | 8 | 128 | 32 | 5.16 | 1.29 | 5619.47 | 6046.81 | 7767.22 | 3.408 | | |
| 125 | +| 4 | 16 | 128 | 32 | 5.15 | 1.29 | 9546.3 | 12146.68 | 14537.41 | 3.408 | | |
| 126 | +| 8 | 1 | 256 | 32 | 8.36 | 1.04 | 957.45 | 784.85 | 1591.01 | 3.408 | | |
| 127 | +| 8 | 2 | 256 | 32 | 8.29 | 1.04 | 1906.56 | 1847.53 | 2799.1 | 3.408 | | |
| 128 | +| 8 | 4 | 256 | 32 | 8.3 | 1.04 | 3691.59 | 3773.67 | 4849.58 | 3.408 | | |
| 129 | +| 8 | 8 | 256 | 32 | 8.29 | 1.04 | 6954.22 | 7681.7 | 8904.0 | 3.408 | | |
| 130 | +| 8 | 16 | 256 | 32 | 8.31 | 1.04 | 11923.83 | 14903.17 | 17072.19 | 3.408 | | |
| 131 | +| 16 | 1 | 512 | 32 | 13.83 | 0.86 | 1156.84 | 904.4 | 1609.17 | 3.408 | | |
| 132 | +| 16 | 2 | 512 | 32 | 13.91 | 0.87 | 2276.35 | 2356.15 | 3121.45 | 3.408 | | |
| 133 | +| 16 | 4 | 512 | 32 | 13.89 | 0.87 | 4440.77 | 4733.34 | 5488.65 | 3.408 | | |
| 134 | +| 16 | 8 | 512 | 32 | 13.83 | 0.86 | 8428.17 | 9407.5 | 10409.59 | 3.408 | | |
| 135 | +| 32 | 1 | 1024 | 32 | 23.28 | 0.73 | 1374.39 | 1603.35 | 1806.99 | 3.408 | | |
| 136 | +| 32 | 2 | 1024 | 32 | 23.26 | 0.73 | 2721.28 | 2607.91 | 3437.08 | 3.408 | | |
| 137 | +| 32 | 4 | 1024 | 32 | 23.18 | 0.72 | 5297.48 | 5234.53 | 6758.76 | 3.408 | | |
| 138 | +| 64 | 1 | 2048 | 32 | 34.97 | 0.55 | 1829.91 | 1899.54 | 2039.18 | 3.69 | | |
| 139 | +| 64 | 2 | 2048 | 32 | 34.78 | 0.54 | 3615.7 | 3795.36 | 4014.19 | 3.69 | | |
| 140 | + | |
| 141 | +## opus-mt-zh-en zh->en | |
| 142 | + | |
| 143 | +- Direction: `zh -> en` | |
| 144 | +- Column: `title_cn` | |
| 145 | +- Loaded rows: `2048` | |
| 146 | +- Load time: `3.2561 s` | |
| 147 | +- Device: `cuda` | |
| 148 | +- DType: `torch.float16` | |
| 149 | +- Cache disabled: `True` | |
| 150 | + | |
| 151 | +### Batch Sweep (`concurrency=1`) | |
| 152 | + | |
| 153 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 154 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 155 | +| 1 | 256 | 256 | 6.15 | 6.15 | 162.53 | 145.96 | 274.74 | 0.285 | | |
| 156 | +| 4 | 256 | 64 | 15.34 | 3.83 | 260.77 | 231.11 | 356.0 | 0.306 | | |
| 157 | +| 8 | 256 | 32 | 25.51 | 3.19 | 313.61 | 271.09 | 379.84 | 0.324 | | |
| 158 | +| 16 | 256 | 16 | 41.44 | 2.59 | 386.06 | 286.4 | 797.93 | 0.366 | | |
| 159 | +| 32 | 256 | 8 | 54.36 | 1.7 | 588.7 | 330.13 | 1693.31 | 0.461 | | |
| 160 | +| 64 | 256 | 4 | 70.15 | 1.1 | 912.33 | 447.93 | 2161.59 | 0.642 | | |
| 161 | + | |
| 162 | +### Concurrency Sweep (`batch_size=1`) | |
| 163 | + | |
| 164 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 165 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 166 | +| 1 | 32 | 32 | 9.21 | 9.21 | 108.53 | 91.7 | 179.12 | 0.642 | | |
| 167 | +| 2 | 32 | 32 | 8.92 | 8.92 | 219.19 | 212.29 | 305.34 | 0.642 | | |
| 168 | +| 4 | 32 | 32 | 9.09 | 9.09 | 411.76 | 420.08 | 583.97 | 0.642 | | |
| 169 | +| 8 | 32 | 32 | 8.85 | 8.85 | 784.14 | 835.73 | 1043.06 | 0.642 | | |
| 170 | +| 16 | 32 | 32 | 9.01 | 9.01 | 1278.4 | 1483.34 | 1994.56 | 0.642 | | |
| 171 | +| 64 | 32 | 32 | 8.82 | 8.82 | 1687.08 | 1563.48 | 3381.58 | 0.642 | | |
| 172 | + | |
| 173 | +### Batch x Concurrency Matrix | |
| 174 | + | |
| 175 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 176 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 177 | +| 1 | 1 | 32 | 32 | 9.01 | 9.01 | 110.96 | 97.37 | 191.91 | 0.642 | | |
| 178 | +| 1 | 2 | 32 | 32 | 9.2 | 9.2 | 212.43 | 191.27 | 299.65 | 0.642 | | |
| 179 | +| 1 | 4 | 32 | 32 | 8.86 | 8.86 | 423.8 | 423.61 | 596.78 | 0.642 | | |
| 180 | +| 1 | 8 | 32 | 32 | 9.04 | 9.04 | 764.39 | 810.05 | 1035.74 | 0.642 | | |
| 181 | +| 1 | 16 | 32 | 32 | 9.06 | 9.06 | 1272.52 | 1468.35 | 1998.99 | 0.642 | | |
| 182 | +| 1 | 64 | 32 | 32 | 9.04 | 9.04 | 1663.7 | 1556.13 | 3296.1 | 0.642 | | |
| 183 | +| 4 | 1 | 128 | 32 | 15.17 | 3.79 | 263.61 | 201.11 | 369.93 | 0.642 | | |
| 184 | +| 4 | 2 | 128 | 32 | 15.25 | 3.81 | 517.25 | 397.13 | 1319.45 | 0.642 | | |
| 185 | +| 4 | 4 | 128 | 32 | 15.27 | 3.82 | 899.61 | 746.76 | 1914.27 | 0.642 | | |
| 186 | +| 4 | 8 | 128 | 32 | 15.39 | 3.85 | 1539.87 | 1466.41 | 2918.85 | 0.642 | | |
| 187 | +| 4 | 16 | 128 | 32 | 15.48 | 3.87 | 2455.89 | 2728.17 | 4644.99 | 0.642 | | |
| 188 | +| 8 | 1 | 256 | 32 | 25.44 | 3.18 | 314.41 | 272.16 | 386.05 | 0.642 | | |
| 189 | +| 8 | 2 | 256 | 32 | 25.42 | 3.18 | 620.41 | 541.88 | 1418.79 | 0.642 | | |
| 190 | +| 8 | 4 | 256 | 32 | 24.96 | 3.12 | 1226.44 | 1070.41 | 2940.56 | 0.642 | | |
| 191 | +| 8 | 8 | 256 | 32 | 25.44 | 3.18 | 2252.81 | 2148.42 | 4143.66 | 0.642 | | |
| 192 | +| 8 | 16 | 256 | 32 | 25.44 | 3.18 | 3961.06 | 5028.31 | 6340.79 | 0.642 | | |
| 193 | +| 16 | 1 | 512 | 32 | 31.66 | 1.98 | 505.38 | 321.69 | 1963.58 | 0.653 | | |
| 194 | +| 16 | 2 | 512 | 32 | 31.41 | 1.96 | 1009.12 | 677.84 | 2341.01 | 0.653 | | |
| 195 | +| 16 | 4 | 512 | 32 | 31.17 | 1.95 | 2006.54 | 1562.68 | 4299.85 | 0.653 | | |
| 196 | +| 16 | 8 | 512 | 32 | 31.51 | 1.97 | 3483.96 | 4085.09 | 5902.32 | 0.653 | | |
| 197 | +| 32 | 1 | 1024 | 32 | 38.66 | 1.21 | 827.7 | 409.78 | 2162.48 | 0.748 | | |
| 198 | +| 32 | 2 | 1024 | 32 | 38.75 | 1.21 | 1613.28 | 916.1 | 3228.1 | 0.748 | | |
| 199 | +| 32 | 4 | 1024 | 32 | 38.66 | 1.21 | 3162.05 | 3239.36 | 4992.07 | 0.748 | | |
| 200 | +| 64 | 1 | 2048 | 32 | 52.44 | 0.82 | 1220.35 | 533.4 | 2508.12 | 0.939 | | |
| 201 | +| 64 | 2 | 2048 | 32 | 52.04 | 0.81 | 2443.4 | 2800.35 | 4765.45 | 0.939 | | |
| 202 | + | |
| 203 | +## opus-mt-en-zh en->zh | |
| 204 | + | |
| 205 | +- Direction: `en -> zh` | |
| 206 | +- Column: `title` | |
| 207 | +- Loaded rows: `2048` | |
| 208 | +- Load time: `3.1612 s` | |
| 209 | +- Device: `cuda` | |
| 210 | +- DType: `torch.float16` | |
| 211 | +- Cache disabled: `True` | |
| 212 | + | |
| 213 | +### Batch Sweep (`concurrency=1`) | |
| 214 | + | |
| 215 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 216 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 217 | +| 1 | 256 | 256 | 4.53 | 4.53 | 220.59 | 177.37 | 411.57 | 0.285 | | |
| 218 | +| 4 | 256 | 64 | 10.12 | 2.53 | 395.37 | 319.23 | 761.49 | 0.307 | | |
| 219 | +| 8 | 256 | 32 | 14.63 | 1.83 | 546.88 | 390.81 | 1930.85 | 0.326 | | |
| 220 | +| 16 | 256 | 16 | 24.33 | 1.52 | 657.6 | 451.22 | 2098.54 | 0.37 | | |
| 221 | +| 32 | 256 | 8 | 33.91 | 1.06 | 943.59 | 603.45 | 2152.28 | 0.462 | | |
| 222 | +| 64 | 256 | 4 | 42.47 | 0.66 | 1507.01 | 1531.43 | 2371.85 | 0.644 | | |
| 223 | + | |
| 224 | +### Concurrency Sweep (`batch_size=1`) | |
| 225 | + | |
| 226 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 227 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 228 | +| 1 | 32 | 32 | 3.6 | 3.6 | 277.73 | 145.85 | 1180.37 | 0.644 | | |
| 229 | +| 2 | 32 | 32 | 3.55 | 3.55 | 559.38 | 346.71 | 1916.96 | 0.644 | | |
| 230 | +| 4 | 32 | 32 | 3.53 | 3.53 | 997.71 | 721.04 | 2944.17 | 0.644 | | |
| 231 | +| 8 | 32 | 32 | 3.51 | 3.51 | 1644.28 | 1590.93 | 3632.99 | 0.644 | | |
| 232 | +| 16 | 32 | 32 | 3.5 | 3.5 | 2600.18 | 2586.34 | 5554.04 | 0.644 | | |
| 233 | +| 64 | 32 | 32 | 3.52 | 3.52 | 3366.52 | 2780.0 | 7950.41 | 0.644 | | |
| 234 | + | |
| 235 | +### Batch x Concurrency Matrix | |
| 236 | + | |
| 237 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | |
| 238 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 239 | +| 1 | 1 | 32 | 32 | 3.54 | 3.54 | 282.22 | 148.26 | 1200.93 | 0.644 | | |
| 240 | +| 1 | 2 | 32 | 32 | 3.52 | 3.52 | 563.85 | 346.8 | 1957.63 | 0.644 | | |
| 241 | +| 1 | 4 | 32 | 32 | 3.52 | 3.52 | 1002.79 | 728.36 | 2949.99 | 0.644 | | |
| 242 | +| 1 | 8 | 32 | 32 | 3.55 | 3.55 | 1611.52 | 1462.84 | 3653.34 | 0.644 | | |
| 243 | +| 1 | 16 | 32 | 32 | 3.49 | 3.49 | 2582.77 | 2594.49 | 5504.07 | 0.644 | | |
| 244 | +| 1 | 64 | 32 | 32 | 3.53 | 3.53 | 3327.65 | 2753.11 | 7907.54 | 0.644 | | |
| 245 | +| 4 | 1 | 128 | 32 | 10.31 | 2.58 | 387.75 | 329.42 | 675.07 | 0.644 | | |
| 246 | +| 4 | 2 | 128 | 32 | 10.12 | 2.53 | 784.49 | 713.61 | 1561.27 | 0.644 | | |
| 247 | +| 4 | 4 | 128 | 32 | 10.08 | 2.52 | 1543.42 | 1438.42 | 2735.85 | 0.644 | | |
| 248 | +| 4 | 8 | 128 | 32 | 10.09 | 2.52 | 2918.06 | 2954.39 | 4327.47 | 0.644 | | |
| 249 | +| 4 | 16 | 128 | 32 | 10.16 | 2.54 | 5087.62 | 5734.85 | 7343.34 | 0.644 | | |
| 250 | +| 8 | 1 | 256 | 32 | 14.49 | 1.81 | 552.16 | 402.18 | 1963.8 | 0.644 | | |
| 251 | +| 8 | 2 | 256 | 32 | 14.54 | 1.82 | 1088.96 | 845.04 | 2565.11 | 0.644 | | |
| 252 | +| 8 | 4 | 256 | 32 | 14.48 | 1.81 | 2143.18 | 1832.69 | 4487.97 | 0.644 | | |
| 253 | +| 8 | 8 | 256 | 32 | 14.51 | 1.81 | 4051.04 | 3734.16 | 5976.18 | 0.644 | | |
| 254 | +| 8 | 16 | 256 | 32 | 14.55 | 1.82 | 6626.6 | 6868.91 | 9395.9 | 0.644 | | |
| 255 | +| 16 | 1 | 512 | 32 | 19.11 | 1.19 | 837.36 | 457.51 | 2030.2 | 0.661 | | |
| 256 | +| 16 | 2 | 512 | 32 | 19.26 | 1.2 | 1599.87 | 1141.06 | 2802.44 | 0.661 | | |
| 257 | +| 16 | 4 | 512 | 32 | 19.1 | 1.19 | 3130.81 | 3166.45 | 4987.49 | 0.661 | | |
| 258 | +| 16 | 8 | 512 | 32 | 19.27 | 1.2 | 5735.13 | 5351.82 | 8417.85 | 0.661 | | |
| 259 | +| 32 | 1 | 1024 | 32 | 26.04 | 0.81 | 1228.66 | 648.06 | 2167.39 | 0.751 | | |
| 260 | +| 32 | 2 | 1024 | 32 | 26.21 | 0.82 | 2372.52 | 2593.74 | 4206.96 | 0.751 | | |
| 261 | +| 32 | 4 | 1024 | 32 | 26.11 | 0.82 | 4499.96 | 3977.65 | 6920.57 | 0.751 | | |
| 262 | +| 64 | 1 | 2048 | 32 | 34.94 | 0.55 | 1831.48 | 2366.24 | 2473.74 | 0.942 | | |
| 263 | +| 64 | 2 | 2048 | 32 | 34.75 | 0.54 | 3603.99 | 3281.6 | 4948.72 | 0.942 | | ... | ... |
scripts/benchmark_translation_local_models.py
| ... | ... | @@ -4,6 +4,7 @@ |
| 4 | 4 | from __future__ import annotations |
| 5 | 5 | |
| 6 | 6 | import argparse |
| 7 | +import concurrent.futures | |
| 7 | 8 | import copy |
| 8 | 9 | import csv |
| 9 | 10 | import json |
| ... | ... | @@ -16,7 +17,7 @@ import sys |
| 16 | 17 | import time |
| 17 | 18 | from datetime import datetime |
| 18 | 19 | from pathlib import Path |
| 19 | -from typing import Any, Dict, Iterable, List | |
| 20 | +from typing import Any, Dict, Iterable, List, Sequence | |
| 20 | 21 | |
| 21 | 22 | import torch |
| 22 | 23 | import transformers |
| ... | ... | @@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402 |
| 30 | 31 | from translation.settings import get_translation_capability # noqa: E402 |
| 31 | 32 | |
| 32 | 33 | |
| 34 | +DEFAULT_BATCH_SIZES = [1, 4, 8, 16, 32, 64] | |
| 35 | +DEFAULT_CONCURRENCIES = [1, 2, 4, 8, 16, 64] | |
| 36 | + | |
| 33 | 37 | SCENARIOS: List[Dict[str, str]] = [ |
| 34 | 38 | { |
| 35 | 39 | "name": "nllb-200-distilled-600m zh->en", |
| ... | ... | @@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [ |
| 69 | 73 | def parse_args() -> argparse.Namespace: |
| 70 | 74 | parser = argparse.ArgumentParser(description="Benchmark local translation models") |
| 71 | 75 | parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path") |
| 72 | - parser.add_argument("--limit", type=int, default=0, help="Limit rows for faster experiments; 0 means all") | |
| 76 | + parser.add_argument("--limit", type=int, default=0, help="Limit rows for baseline or single-case run; 0 means all") | |
| 73 | 77 | parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports") |
| 74 | 78 | parser.add_argument("--single", action="store_true", help="Run a single scenario in-process") |
| 75 | 79 | parser.add_argument("--model", default="", help="Model name for --single mode") |
| ... | ... | @@ -84,9 +88,67 @@ def parse_args() -> argparse.Namespace: |
| 84 | 88 | parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams") |
| 85 | 89 | parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa") |
| 86 | 90 | parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring") |
| 91 | + parser.add_argument("--disable-cache", action="store_true", help="Disable translation cache during benchmarks") | |
| 92 | + parser.add_argument( | |
| 93 | + "--suite", | |
| 94 | + choices=["baseline", "extended"], | |
| 95 | + default="baseline", | |
| 96 | + help="baseline keeps the previous all-scenarios summary; extended adds batch/concurrency/matrix sweeps", | |
| 97 | + ) | |
| 98 | + parser.add_argument( | |
| 99 | + "--batch-size-list", | |
| 100 | + default="", | |
| 101 | + help="Comma-separated batch sizes for extended suite; default 1,4,8,16,32,64", | |
| 102 | + ) | |
| 103 | + parser.add_argument( | |
| 104 | + "--concurrency-list", | |
| 105 | + default="", | |
| 106 | + help="Comma-separated concurrency levels for extended suite; default 1,2,4,8,16,64", | |
| 107 | + ) | |
| 108 | + parser.add_argument( | |
| 109 | + "--serial-items-per-case", | |
| 110 | + type=int, | |
| 111 | + default=512, | |
| 112 | + help="Items per batch-size case in extended suite", | |
| 113 | + ) | |
| 114 | + parser.add_argument( | |
| 115 | + "--concurrency-requests-per-case", | |
| 116 | + type=int, | |
| 117 | + default=128, | |
| 118 | + help="Requests per concurrency or matrix case in extended suite", | |
| 119 | + ) | |
| 120 | + parser.add_argument( | |
| 121 | + "--concurrency-batch-size", | |
| 122 | + type=int, | |
| 123 | + default=1, | |
| 124 | + help="Batch size used by the dedicated concurrency sweep", | |
| 125 | + ) | |
| 126 | + parser.add_argument( | |
| 127 | + "--max-batch-concurrency-product", | |
| 128 | + type=int, | |
| 129 | + default=128, | |
| 130 | + help="Skip matrix cases where batch_size * concurrency exceeds this value; 0 disables the limit", | |
| 131 | + ) | |
| 87 | 132 | return parser.parse_args() |
| 88 | 133 | |
| 89 | 134 | |
| 135 | +def parse_csv_ints(raw: str, fallback: Sequence[int]) -> List[int]: | |
| 136 | + if not raw.strip(): | |
| 137 | + return list(fallback) | |
| 138 | + values: List[int] = [] | |
| 139 | + for item in raw.split(","): | |
| 140 | + stripped = item.strip() | |
| 141 | + if not stripped: | |
| 142 | + continue | |
| 143 | + value = int(stripped) | |
| 144 | + if value <= 0: | |
| 145 | + raise ValueError(f"Expected positive integer, got {value}") | |
| 146 | + values.append(value) | |
| 147 | + if not values: | |
| 148 | + raise ValueError("Parsed empty integer list") | |
| 149 | + return values | |
| 150 | + | |
| 151 | + | |
| 90 | 152 | def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: |
| 91 | 153 | texts: List[str] = [] |
| 92 | 154 | with csv_path.open("r", encoding="utf-8") as handle: |
| ... | ... | @@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: |
| 102 | 164 | return texts |
| 103 | 165 | |
| 104 | 166 | |
| 105 | -def batched(values: List[str], batch_size: int) -> Iterable[List[str]]: | |
| 167 | +def batched(values: Sequence[str], batch_size: int) -> Iterable[List[str]]: | |
| 106 | 168 | for start in range(0, len(values), batch_size): |
| 107 | - yield values[start:start + batch_size] | |
| 169 | + yield list(values[start:start + batch_size]) | |
| 108 | 170 | |
| 109 | 171 | |
| 110 | 172 | def percentile(values: List[float], p: float) -> float: |
| ... | ... | @@ -148,15 +210,34 @@ def build_environment_info() -> Dict[str, Any]: |
| 148 | 210 | } |
| 149 | 211 | |
| 150 | 212 | |
| 151 | -def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | |
| 152 | - csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) | |
| 213 | +def scenario_from_args(args: argparse.Namespace) -> Dict[str, str]: | |
| 214 | + return { | |
| 215 | + "name": f"{args.model} {args.source_lang}->{args.target_lang}", | |
| 216 | + "model": args.model, | |
| 217 | + "source_lang": args.source_lang, | |
| 218 | + "target_lang": args.target_lang, | |
| 219 | + "column": args.column, | |
| 220 | + "scene": args.scene, | |
| 221 | + } | |
| 222 | + | |
| 223 | + | |
| 224 | +def build_config_and_capability( | |
| 225 | + args: argparse.Namespace, | |
| 226 | + *, | |
| 227 | + batch_size_override: int | None = None, | |
| 228 | +) -> tuple[Dict[str, Any], Dict[str, Any]]: | |
| 153 | 229 | config = copy.deepcopy(get_translation_config()) |
| 230 | + for name, cfg in config["capabilities"].items(): | |
| 231 | + cfg["enabled"] = name == args.model | |
| 232 | + config["default_model"] = args.model | |
| 154 | 233 | capability = get_translation_capability(config, args.model, require_enabled=False) |
| 155 | 234 | if args.device_override: |
| 156 | 235 | capability["device"] = args.device_override |
| 157 | 236 | if args.torch_dtype_override: |
| 158 | 237 | capability["torch_dtype"] = args.torch_dtype_override |
| 159 | - if args.batch_size: | |
| 238 | + if batch_size_override is not None: | |
| 239 | + capability["batch_size"] = batch_size_override | |
| 240 | + elif args.batch_size: | |
| 160 | 241 | capability["batch_size"] = args.batch_size |
| 161 | 242 | if args.max_new_tokens: |
| 162 | 243 | capability["max_new_tokens"] = args.max_new_tokens |
| ... | ... | @@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: |
| 164 | 245 | capability["num_beams"] = args.num_beams |
| 165 | 246 | if args.attn_implementation: |
| 166 | 247 | capability["attn_implementation"] = args.attn_implementation |
| 248 | + if args.disable_cache: | |
| 249 | + capability["use_cache"] = False | |
| 167 | 250 | config["capabilities"][args.model] = capability |
| 168 | - configured_batch_size = int(capability.get("batch_size") or 1) | |
| 169 | - batch_size = configured_batch_size | |
| 170 | - texts = load_texts(csv_path, args.column, args.limit) | |
| 251 | + return config, capability | |
| 171 | 252 | |
| 172 | - service = TranslationService(config) | |
| 253 | + | |
| 254 | +def ensure_cuda_stats_reset() -> None: | |
| 173 | 255 | if torch.cuda.is_available(): |
| 174 | 256 | torch.cuda.empty_cache() |
| 175 | 257 | torch.cuda.reset_peak_memory_stats() |
| 176 | 258 | |
| 177 | - load_start = time.perf_counter() | |
| 178 | - backend = service.get_backend(args.model) | |
| 179 | - load_seconds = time.perf_counter() - load_start | |
| 180 | 259 | |
| 181 | - warmup_batches = min(max(args.warmup_batches, 0), max(1, math.ceil(len(texts) / batch_size))) | |
| 182 | - for batch in list(batched(texts, batch_size))[:warmup_batches]: | |
| 260 | +def build_memory_metrics() -> Dict[str, Any]: | |
| 261 | + peak_gpu_mem_gb = None | |
| 262 | + peak_gpu_reserved_gb = None | |
| 263 | + if torch.cuda.is_available(): | |
| 264 | + peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3) | |
| 265 | + peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) | |
| 266 | + max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2) | |
| 267 | + return { | |
| 268 | + "max_rss_mb": max_rss_mb, | |
| 269 | + "peak_gpu_memory_gb": peak_gpu_mem_gb, | |
| 270 | + "peak_gpu_reserved_gb": peak_gpu_reserved_gb, | |
| 271 | + } | |
| 272 | + | |
| 273 | + | |
| 274 | +def make_request_payload(batch: Sequence[str]) -> str | List[str]: | |
| 275 | + if len(batch) == 1: | |
| 276 | + return batch[0] | |
| 277 | + return list(batch) | |
| 278 | + | |
| 279 | + | |
| 280 | +def benchmark_serial_case( | |
| 281 | + *, | |
| 282 | + service: TranslationService, | |
| 283 | + backend: Any, | |
| 284 | + scenario: Dict[str, str], | |
| 285 | + capability: Dict[str, Any], | |
| 286 | + texts: List[str], | |
| 287 | + batch_size: int, | |
| 288 | + warmup_batches: int, | |
| 289 | +) -> Dict[str, Any]: | |
| 290 | + backend.batch_size = batch_size | |
| 291 | + measured_batches = list(batched(texts, batch_size)) | |
| 292 | + warmup_count = min(max(warmup_batches, 0), len(measured_batches)) | |
| 293 | + | |
| 294 | + for batch in measured_batches[:warmup_count]: | |
| 183 | 295 | service.translate( |
| 184 | - text=batch, | |
| 185 | - source_lang=args.source_lang, | |
| 186 | - target_lang=args.target_lang, | |
| 187 | - model=args.model, | |
| 188 | - scene=args.scene, | |
| 296 | + text=make_request_payload(batch), | |
| 297 | + source_lang=scenario["source_lang"], | |
| 298 | + target_lang=scenario["target_lang"], | |
| 299 | + model=scenario["model"], | |
| 300 | + scene=scenario["scene"], | |
| 189 | 301 | ) |
| 190 | 302 | |
| 191 | 303 | batch_latencies_ms: List[float] = [] |
| ... | ... | @@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: |
| 193 | 305 | failure_count = 0 |
| 194 | 306 | output_chars = 0 |
| 195 | 307 | total_input_chars = sum(len(text) for text in texts) |
| 196 | - measured_batches = list(batched(texts, batch_size)) | |
| 197 | 308 | |
| 198 | 309 | start = time.perf_counter() |
| 199 | 310 | for batch in measured_batches: |
| 200 | 311 | batch_start = time.perf_counter() |
| 201 | 312 | outputs = service.translate( |
| 202 | - text=batch, | |
| 203 | - source_lang=args.source_lang, | |
| 204 | - target_lang=args.target_lang, | |
| 205 | - model=args.model, | |
| 206 | - scene=args.scene, | |
| 313 | + text=make_request_payload(batch), | |
| 314 | + source_lang=scenario["source_lang"], | |
| 315 | + target_lang=scenario["target_lang"], | |
| 316 | + model=scenario["model"], | |
| 317 | + scene=scenario["scene"], | |
| 207 | 318 | ) |
| 208 | 319 | elapsed_ms = (time.perf_counter() - batch_start) * 1000 |
| 209 | 320 | batch_latencies_ms.append(elapsed_ms) |
| 210 | 321 | |
| 211 | - if not isinstance(outputs, list): | |
| 212 | - raise RuntimeError(f"Expected list output for batch translation, got {type(outputs)!r}") | |
| 213 | - for item in outputs: | |
| 322 | + if isinstance(outputs, list): | |
| 323 | + result_items = outputs | |
| 324 | + else: | |
| 325 | + result_items = [outputs] | |
| 326 | + for item in result_items: | |
| 214 | 327 | if item is None: |
| 215 | 328 | failure_count += 1 |
| 216 | 329 | else: |
| 217 | 330 | success_count += 1 |
| 218 | 331 | output_chars += len(item) |
| 219 | 332 | translate_seconds = time.perf_counter() - start |
| 333 | + total_items = len(texts) | |
| 334 | + memory = build_memory_metrics() | |
| 220 | 335 | |
| 221 | - peak_gpu_mem_gb = None | |
| 222 | - peak_gpu_reserved_gb = None | |
| 223 | - if torch.cuda.is_available(): | |
| 224 | - peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3) | |
| 225 | - peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) | |
| 336 | + return { | |
| 337 | + "mode": "serial_batch", | |
| 338 | + "batch_size": batch_size, | |
| 339 | + "concurrency": 1, | |
| 340 | + "rows": total_items, | |
| 341 | + "requests": len(measured_batches), | |
| 342 | + "input_chars": total_input_chars, | |
| 343 | + "load_seconds": 0.0, | |
| 344 | + "translate_seconds": round(translate_seconds, 4), | |
| 345 | + "total_seconds": round(translate_seconds, 4), | |
| 346 | + "batch_count": len(batch_latencies_ms), | |
| 347 | + "request_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2), | |
| 348 | + "request_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2), | |
| 349 | + "request_latency_max_ms": round(max(batch_latencies_ms), 2), | |
| 350 | + "avg_request_latency_ms": round(statistics.fmean(batch_latencies_ms), 2), | |
| 351 | + "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3), | |
| 352 | + "requests_per_second": round(len(measured_batches) / translate_seconds, 2), | |
| 353 | + "items_per_second": round(total_items / translate_seconds, 2), | |
| 354 | + "input_chars_per_second": round(total_input_chars / translate_seconds, 2), | |
| 355 | + "output_chars_per_second": round(output_chars / translate_seconds, 2), | |
| 356 | + "success_count": success_count, | |
| 357 | + "failure_count": failure_count, | |
| 358 | + "success_rate": round(success_count / total_items, 6), | |
| 359 | + "device": str(getattr(backend, "device", capability.get("device", "unknown"))), | |
| 360 | + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), | |
| 361 | + "configured_batch_size": int(capability.get("batch_size") or batch_size), | |
| 362 | + "used_batch_size": batch_size, | |
| 363 | + "warmup_batches": warmup_count, | |
| 364 | + **memory, | |
| 365 | + } | |
| 226 | 366 | |
| 227 | - max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2) | |
| 228 | - total_items = len(texts) | |
| 367 | + | |
| 368 | +def benchmark_concurrency_case( | |
| 369 | + *, | |
| 370 | + service: TranslationService, | |
| 371 | + backend: Any, | |
| 372 | + scenario: Dict[str, str], | |
| 373 | + capability: Dict[str, Any], | |
| 374 | + texts: List[str], | |
| 375 | + batch_size: int, | |
| 376 | + concurrency: int, | |
| 377 | + requests_per_case: int, | |
| 378 | + warmup_batches: int, | |
| 379 | +) -> Dict[str, Any]: | |
| 380 | + backend.batch_size = batch_size | |
| 381 | + required_items = batch_size * requests_per_case | |
| 382 | + case_texts = texts[:required_items] | |
| 383 | + request_batches = list(batched(case_texts, batch_size)) | |
| 384 | + if not request_batches: | |
| 385 | + raise ValueError("No request batches prepared for concurrency benchmark") | |
| 386 | + warmup_count = min(max(warmup_batches, 0), len(request_batches)) | |
| 387 | + | |
| 388 | + for batch in request_batches[:warmup_count]: | |
| 389 | + service.translate( | |
| 390 | + text=make_request_payload(batch), | |
| 391 | + source_lang=scenario["source_lang"], | |
| 392 | + target_lang=scenario["target_lang"], | |
| 393 | + model=scenario["model"], | |
| 394 | + scene=scenario["scene"], | |
| 395 | + ) | |
| 396 | + | |
| 397 | + request_latencies_ms: List[float] = [] | |
| 398 | + success_count = 0 | |
| 399 | + failure_count = 0 | |
| 400 | + output_chars = 0 | |
| 401 | + total_input_chars = sum(len(text) for text in case_texts) | |
| 402 | + | |
| 403 | + def worker(batch: List[str]) -> tuple[float, int, int, int]: | |
| 404 | + started = time.perf_counter() | |
| 405 | + outputs = service.translate( | |
| 406 | + text=make_request_payload(batch), | |
| 407 | + source_lang=scenario["source_lang"], | |
| 408 | + target_lang=scenario["target_lang"], | |
| 409 | + model=scenario["model"], | |
| 410 | + scene=scenario["scene"], | |
| 411 | + ) | |
| 412 | + elapsed_ms = (time.perf_counter() - started) * 1000 | |
| 413 | + if isinstance(outputs, list): | |
| 414 | + result_items = outputs | |
| 415 | + else: | |
| 416 | + result_items = [outputs] | |
| 417 | + local_success = 0 | |
| 418 | + local_failure = 0 | |
| 419 | + local_output_chars = 0 | |
| 420 | + for item in result_items: | |
| 421 | + if item is None: | |
| 422 | + local_failure += 1 | |
| 423 | + else: | |
| 424 | + local_success += 1 | |
| 425 | + local_output_chars += len(item) | |
| 426 | + return elapsed_ms, local_success, local_failure, local_output_chars | |
| 427 | + | |
| 428 | + wall_start = time.perf_counter() | |
| 429 | + with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor: | |
| 430 | + futures = [executor.submit(worker, batch) for batch in request_batches] | |
| 431 | + for future in concurrent.futures.as_completed(futures): | |
| 432 | + latency_ms, local_success, local_failure, local_output_chars = future.result() | |
| 433 | + request_latencies_ms.append(latency_ms) | |
| 434 | + success_count += local_success | |
| 435 | + failure_count += local_failure | |
| 436 | + output_chars += local_output_chars | |
| 437 | + wall_seconds = time.perf_counter() - wall_start | |
| 438 | + total_items = len(case_texts) | |
| 439 | + memory = build_memory_metrics() | |
| 229 | 440 | |
| 230 | 441 | return { |
| 231 | - "scenario": { | |
| 232 | - "name": f"{args.model} {args.source_lang}->{args.target_lang}", | |
| 233 | - "model": args.model, | |
| 234 | - "source_lang": args.source_lang, | |
| 235 | - "target_lang": args.target_lang, | |
| 236 | - "column": args.column, | |
| 237 | - "scene": args.scene, | |
| 442 | + "mode": "concurrency", | |
| 443 | + "batch_size": batch_size, | |
| 444 | + "concurrency": concurrency, | |
| 445 | + "rows": total_items, | |
| 446 | + "requests": len(request_batches), | |
| 447 | + "input_chars": total_input_chars, | |
| 448 | + "load_seconds": 0.0, | |
| 449 | + "translate_seconds": round(wall_seconds, 4), | |
| 450 | + "total_seconds": round(wall_seconds, 4), | |
| 451 | + "batch_count": len(request_latencies_ms), | |
| 452 | + "request_latency_p50_ms": round(percentile(request_latencies_ms, 0.50), 2), | |
| 453 | + "request_latency_p95_ms": round(percentile(request_latencies_ms, 0.95), 2), | |
| 454 | + "request_latency_max_ms": round(max(request_latencies_ms), 2), | |
| 455 | + "avg_request_latency_ms": round(statistics.fmean(request_latencies_ms), 2), | |
| 456 | + "avg_item_latency_ms": round((wall_seconds / total_items) * 1000, 3), | |
| 457 | + "requests_per_second": round(len(request_batches) / wall_seconds, 2), | |
| 458 | + "items_per_second": round(total_items / wall_seconds, 2), | |
| 459 | + "input_chars_per_second": round(total_input_chars / wall_seconds, 2), | |
| 460 | + "output_chars_per_second": round(output_chars / wall_seconds, 2), | |
| 461 | + "success_count": success_count, | |
| 462 | + "failure_count": failure_count, | |
| 463 | + "success_rate": round(success_count / total_items, 6), | |
| 464 | + "device": str(getattr(backend, "device", capability.get("device", "unknown"))), | |
| 465 | + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), | |
| 466 | + "configured_batch_size": int(capability.get("batch_size") or batch_size), | |
| 467 | + "used_batch_size": batch_size, | |
| 468 | + "warmup_batches": warmup_count, | |
| 469 | + **memory, | |
| 470 | + } | |
| 471 | + | |
| 472 | + | |
| 473 | +def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | |
| 474 | + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) | |
| 475 | + scenario = scenario_from_args(args) | |
| 476 | + config, capability = build_config_and_capability(args) | |
| 477 | + configured_batch_size = int(capability.get("batch_size") or 1) | |
| 478 | + batch_size = configured_batch_size | |
| 479 | + texts = load_texts(csv_path, args.column, args.limit) | |
| 480 | + | |
| 481 | + ensure_cuda_stats_reset() | |
| 482 | + load_start = time.perf_counter() | |
| 483 | + service = TranslationService(config) | |
| 484 | + backend = service.get_backend(args.model) | |
| 485 | + load_seconds = time.perf_counter() - load_start | |
| 486 | + | |
| 487 | + runtime = benchmark_serial_case( | |
| 488 | + service=service, | |
| 489 | + backend=backend, | |
| 490 | + scenario=scenario, | |
| 491 | + capability=capability, | |
| 492 | + texts=texts, | |
| 493 | + batch_size=batch_size, | |
| 494 | + warmup_batches=args.warmup_batches, | |
| 495 | + ) | |
| 496 | + runtime["load_seconds"] = round(load_seconds, 4) | |
| 497 | + runtime["total_seconds"] = round(runtime["load_seconds"] + runtime["translate_seconds"], 4) | |
| 498 | + | |
| 499 | + return { | |
| 500 | + "scenario": scenario, | |
| 501 | + "dataset": { | |
| 502 | + "csv_path": str(csv_path), | |
| 503 | + "rows": len(texts), | |
| 504 | + "input_chars": sum(len(text) for text in texts), | |
| 238 | 505 | }, |
| 506 | + "runtime": runtime, | |
| 507 | + } | |
| 508 | + | |
| 509 | + | |
| 510 | +def benchmark_extended_scenario(args: argparse.Namespace) -> Dict[str, Any]: | |
| 511 | + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) | |
| 512 | + scenario = scenario_from_args(args) | |
| 513 | + batch_sizes = parse_csv_ints(args.batch_size_list, DEFAULT_BATCH_SIZES) | |
| 514 | + concurrencies = parse_csv_ints(args.concurrency_list, DEFAULT_CONCURRENCIES) | |
| 515 | + largest_batch = max(batch_sizes + [args.concurrency_batch_size]) | |
| 516 | + largest_concurrency = max(concurrencies) | |
| 517 | + max_product = args.max_batch_concurrency_product | |
| 518 | + required_items = max( | |
| 519 | + args.limit or 0, | |
| 520 | + max(args.serial_items_per_case, largest_batch), | |
| 521 | + args.concurrency_requests_per_case * args.concurrency_batch_size, | |
| 522 | + largest_batch * args.concurrency_requests_per_case, | |
| 523 | + ) | |
| 524 | + texts = load_texts(csv_path, args.column, required_items) | |
| 525 | + config, capability = build_config_and_capability(args) | |
| 526 | + | |
| 527 | + ensure_cuda_stats_reset() | |
| 528 | + load_start = time.perf_counter() | |
| 529 | + service = TranslationService(config) | |
| 530 | + backend = service.get_backend(args.model) | |
| 531 | + load_seconds = time.perf_counter() - load_start | |
| 532 | + | |
| 533 | + batch_sweep: List[Dict[str, Any]] = [] | |
| 534 | + concurrency_sweep: List[Dict[str, Any]] = [] | |
| 535 | + matrix_results: List[Dict[str, Any]] = [] | |
| 536 | + | |
| 537 | + for batch_size in batch_sizes: | |
| 538 | + case_texts = texts[: max(batch_size, args.serial_items_per_case)] | |
| 539 | + batch_sweep.append( | |
| 540 | + benchmark_serial_case( | |
| 541 | + service=service, | |
| 542 | + backend=backend, | |
| 543 | + scenario=scenario, | |
| 544 | + capability=capability, | |
| 545 | + texts=case_texts, | |
| 546 | + batch_size=batch_size, | |
| 547 | + warmup_batches=args.warmup_batches, | |
| 548 | + ) | |
| 549 | + ) | |
| 550 | + | |
| 551 | + for concurrency in concurrencies: | |
| 552 | + concurrency_sweep.append( | |
| 553 | + benchmark_concurrency_case( | |
| 554 | + service=service, | |
| 555 | + backend=backend, | |
| 556 | + scenario=scenario, | |
| 557 | + capability=capability, | |
| 558 | + texts=texts, | |
| 559 | + batch_size=args.concurrency_batch_size, | |
| 560 | + concurrency=concurrency, | |
| 561 | + requests_per_case=args.concurrency_requests_per_case, | |
| 562 | + warmup_batches=args.warmup_batches, | |
| 563 | + ) | |
| 564 | + ) | |
| 565 | + | |
| 566 | + for batch_size in batch_sizes: | |
| 567 | + for concurrency in concurrencies: | |
| 568 | + if max_product > 0 and batch_size * concurrency > max_product: | |
| 569 | + continue | |
| 570 | + matrix_results.append( | |
| 571 | + benchmark_concurrency_case( | |
| 572 | + service=service, | |
| 573 | + backend=backend, | |
| 574 | + scenario=scenario, | |
| 575 | + capability=capability, | |
| 576 | + texts=texts, | |
| 577 | + batch_size=batch_size, | |
| 578 | + concurrency=concurrency, | |
| 579 | + requests_per_case=args.concurrency_requests_per_case, | |
| 580 | + warmup_batches=args.warmup_batches, | |
| 581 | + ) | |
| 582 | + ) | |
| 583 | + | |
| 584 | + for collection in (batch_sweep, concurrency_sweep, matrix_results): | |
| 585 | + for idx, item in enumerate(collection): | |
| 586 | + item["load_seconds"] = round(load_seconds if idx == 0 else 0.0, 4) | |
| 587 | + item["total_seconds"] = round(item["load_seconds"] + item["translate_seconds"], 4) | |
| 588 | + | |
| 589 | + return { | |
| 590 | + "scenario": scenario, | |
| 239 | 591 | "dataset": { |
| 240 | 592 | "csv_path": str(csv_path), |
| 241 | - "rows": total_items, | |
| 242 | - "input_chars": total_input_chars, | |
| 593 | + "rows_loaded": len(texts), | |
| 594 | + }, | |
| 595 | + "config": { | |
| 596 | + "batch_sizes": batch_sizes, | |
| 597 | + "concurrencies": concurrencies, | |
| 598 | + "serial_items_per_case": args.serial_items_per_case, | |
| 599 | + "concurrency_requests_per_case": args.concurrency_requests_per_case, | |
| 600 | + "concurrency_batch_size": args.concurrency_batch_size, | |
| 601 | + "max_batch_concurrency_product": max_product, | |
| 602 | + "cache_disabled": bool(args.disable_cache), | |
| 243 | 603 | }, |
| 244 | - "runtime": { | |
| 604 | + "runtime_defaults": { | |
| 245 | 605 | "device": str(getattr(backend, "device", capability.get("device", "unknown"))), |
| 246 | 606 | "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), |
| 247 | - "configured_batch_size": configured_batch_size, | |
| 248 | - "used_batch_size": batch_size, | |
| 249 | - "warmup_batches": warmup_batches, | |
| 607 | + "configured_batch_size": int(capability.get("batch_size") or 1), | |
| 250 | 608 | "load_seconds": round(load_seconds, 4), |
| 251 | - "translate_seconds": round(translate_seconds, 4), | |
| 252 | - "total_seconds": round(load_seconds + translate_seconds, 4), | |
| 253 | - "batch_count": len(batch_latencies_ms), | |
| 254 | - "first_batch_ms": round(batch_latencies_ms[0], 2), | |
| 255 | - "batch_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2), | |
| 256 | - "batch_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2), | |
| 257 | - "batch_latency_max_ms": round(max(batch_latencies_ms), 2), | |
| 258 | - "avg_batch_latency_ms": round(statistics.fmean(batch_latencies_ms), 2), | |
| 259 | - "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3), | |
| 260 | - "items_per_second": round(total_items / translate_seconds, 2), | |
| 261 | - "input_chars_per_second": round(total_input_chars / translate_seconds, 2), | |
| 262 | - "output_chars_per_second": round(output_chars / translate_seconds, 2), | |
| 263 | - "success_count": success_count, | |
| 264 | - "failure_count": failure_count, | |
| 265 | - "success_rate": round(success_count / total_items, 6), | |
| 266 | - "max_rss_mb": max_rss_mb, | |
| 267 | - "peak_gpu_memory_gb": peak_gpu_mem_gb, | |
| 268 | - "peak_gpu_reserved_gb": peak_gpu_reserved_gb, | |
| 269 | 609 | }, |
| 610 | + "batch_sweep": batch_sweep, | |
| 611 | + "concurrency_sweep": concurrency_sweep, | |
| 612 | + "matrix": matrix_results, | |
| 270 | 613 | } |
| 271 | 614 | |
| 272 | 615 | |
| 273 | 616 | def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: |
| 274 | 617 | report = { |
| 275 | 618 | "generated_at": datetime.now().isoformat(timespec="seconds"), |
| 619 | + "suite": args.suite, | |
| 276 | 620 | "environment": build_environment_info(), |
| 277 | 621 | "scenarios": [], |
| 278 | 622 | } |
| ... | ... | @@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: |
| 296 | 640 | scenario["scene"], |
| 297 | 641 | "--warmup-batches", |
| 298 | 642 | str(args.warmup_batches), |
| 643 | + "--suite", | |
| 644 | + args.suite, | |
| 645 | + "--serial-items-per-case", | |
| 646 | + str(args.serial_items_per_case), | |
| 647 | + "--concurrency-requests-per-case", | |
| 648 | + str(args.concurrency_requests_per_case), | |
| 649 | + "--concurrency-batch-size", | |
| 650 | + str(args.concurrency_batch_size), | |
| 651 | + "--max-batch-concurrency-product", | |
| 652 | + str(args.max_batch_concurrency_product), | |
| 299 | 653 | ] |
| 300 | 654 | if args.limit: |
| 301 | 655 | cmd.extend(["--limit", str(args.limit)]) |
| 302 | 656 | if args.batch_size: |
| 303 | 657 | cmd.extend(["--batch-size", str(args.batch_size)]) |
| 658 | + if args.batch_size_list: | |
| 659 | + cmd.extend(["--batch-size-list", args.batch_size_list]) | |
| 660 | + if args.concurrency_list: | |
| 661 | + cmd.extend(["--concurrency-list", args.concurrency_list]) | |
| 304 | 662 | if args.device_override: |
| 305 | 663 | cmd.extend(["--device-override", args.device_override]) |
| 306 | 664 | if args.torch_dtype_override: |
| ... | ... | @@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: |
| 311 | 669 | cmd.extend(["--num-beams", str(args.num_beams)]) |
| 312 | 670 | if args.attn_implementation: |
| 313 | 671 | cmd.extend(["--attn-implementation", args.attn_implementation]) |
| 672 | + if args.disable_cache: | |
| 673 | + cmd.append("--disable-cache") | |
| 314 | 674 | |
| 315 | 675 | completed = subprocess.run(cmd, capture_output=True, text=True, check=True) |
| 316 | 676 | result_line = "" |
| ... | ... | @@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: |
| 327 | 687 | return report |
| 328 | 688 | |
| 329 | 689 | |
| 330 | -def render_markdown_report(report: Dict[str, Any]) -> str: | |
| 690 | +def render_baseline_markdown_report(report: Dict[str, Any]) -> str: | |
| 331 | 691 | lines = [ |
| 332 | 692 | "# Local Translation Model Benchmark", |
| 333 | 693 | "", |
| 334 | 694 | f"- Generated at: `{report['generated_at']}`", |
| 695 | + f"- Suite: `{report['suite']}`", | |
| 335 | 696 | f"- Python: `{report['environment']['python']}`", |
| 336 | 697 | f"- Torch: `{report['environment']['torch']}`", |
| 337 | 698 | f"- Transformers: `{report['environment']['transformers']}`", |
| ... | ... | @@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -> str: |
| 342 | 703 | lines.extend( |
| 343 | 704 | [ |
| 344 | 705 | "", |
| 345 | - "| Scenario | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Load s | Peak GPU GiB | Success |", | |
| 706 | + "| Scenario | Items/s | Avg item ms | Req p50 ms | Req p95 ms | Load s | Peak GPU GiB | Success |", | |
| 346 | 707 | "|---|---:|---:|---:|---:|---:|---:|---:|", |
| 347 | 708 | ] |
| 348 | 709 | ) |
| 349 | 710 | for item in report["scenarios"]: |
| 350 | 711 | runtime = item["runtime"] |
| 351 | 712 | lines.append( |
| 352 | - "| {name} | {items_per_second} | {avg_item_latency_ms} | {batch_latency_p50_ms} | {batch_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( | |
| 713 | + "| {name} | {items_per_second} | {avg_item_latency_ms} | {request_latency_p50_ms} | {request_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( | |
| 353 | 714 | name=item["scenario"]["name"], |
| 354 | 715 | items_per_second=runtime["items_per_second"], |
| 355 | 716 | avg_item_latency_ms=runtime["avg_item_latency_ms"], |
| 356 | - batch_latency_p50_ms=runtime["batch_latency_p50_ms"], | |
| 357 | - batch_latency_p95_ms=runtime["batch_latency_p95_ms"], | |
| 717 | + request_latency_p50_ms=runtime["request_latency_p50_ms"], | |
| 718 | + request_latency_p95_ms=runtime["request_latency_p95_ms"], | |
| 358 | 719 | load_seconds=runtime["load_seconds"], |
| 359 | 720 | peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"], |
| 360 | 721 | success_rate=runtime["success_rate"], |
| ... | ... | @@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -> str: |
| 375 | 736 | f"- Load time: `{runtime['load_seconds']} s`", |
| 376 | 737 | f"- Translate time: `{runtime['translate_seconds']} s`", |
| 377 | 738 | f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`", |
| 378 | - f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, batch p50 `{runtime['batch_latency_p50_ms']} ms`, batch p95 `{runtime['batch_latency_p95_ms']} ms`, batch max `{runtime['batch_latency_max_ms']} ms`", | |
| 739 | + f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, req p50 `{runtime['request_latency_p50_ms']} ms`, req p95 `{runtime['request_latency_p95_ms']} ms`, req max `{runtime['request_latency_max_ms']} ms`", | |
| 379 | 740 | f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`", |
| 380 | 741 | f"- Success: `{runtime['success_count']}/{dataset['rows']}`", |
| 381 | 742 | "", |
| ... | ... | @@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -> str: |
| 384 | 745 | return "\n".join(lines) |
| 385 | 746 | |
| 386 | 747 | |
| 748 | +def render_case_table( | |
| 749 | + title: str, | |
| 750 | + rows: Sequence[Dict[str, Any]], | |
| 751 | + *, | |
| 752 | + include_batch: bool, | |
| 753 | + include_concurrency: bool, | |
| 754 | +) -> List[str]: | |
| 755 | + headers = ["Rows", "Requests", "Items/s", "Req/s", "Avg req ms", "Req p50 ms", "Req p95 ms", "Peak GPU GiB"] | |
| 756 | + prefix_headers: List[str] = [] | |
| 757 | + if include_batch: | |
| 758 | + prefix_headers.append("Batch") | |
| 759 | + if include_concurrency: | |
| 760 | + prefix_headers.append("Concurrency") | |
| 761 | + headers = prefix_headers + headers | |
| 762 | + lines = [f"### {title}", ""] | |
| 763 | + lines.append("| " + " | ".join(headers) + " |") | |
| 764 | + lines.append("|" + "|".join(["---:"] * len(headers)) + "|") | |
| 765 | + for item in rows: | |
| 766 | + values: List[str] = [] | |
| 767 | + if include_batch: | |
| 768 | + values.append(str(item["batch_size"])) | |
| 769 | + if include_concurrency: | |
| 770 | + values.append(str(item["concurrency"])) | |
| 771 | + values.extend( | |
| 772 | + [ | |
| 773 | + str(item["rows"]), | |
| 774 | + str(item["requests"]), | |
| 775 | + str(item["items_per_second"]), | |
| 776 | + str(item["requests_per_second"]), | |
| 777 | + str(item["avg_request_latency_ms"]), | |
| 778 | + str(item["request_latency_p50_ms"]), | |
| 779 | + str(item["request_latency_p95_ms"]), | |
| 780 | + str(item["peak_gpu_memory_gb"]), | |
| 781 | + ] | |
| 782 | + ) | |
| 783 | + lines.append("| " + " | ".join(values) + " |") | |
| 784 | + lines.append("") | |
| 785 | + return lines | |
| 786 | + | |
| 787 | + | |
| 788 | +def render_extended_markdown_report(report: Dict[str, Any]) -> str: | |
| 789 | + lines = [ | |
| 790 | + "# Local Translation Model Extended Benchmark", | |
| 791 | + "", | |
| 792 | + f"- Generated at: `{report['generated_at']}`", | |
| 793 | + f"- Suite: `{report['suite']}`", | |
| 794 | + f"- Python: `{report['environment']['python']}`", | |
| 795 | + f"- Torch: `{report['environment']['torch']}`", | |
| 796 | + f"- Transformers: `{report['environment']['transformers']}`", | |
| 797 | + f"- CUDA: `{report['environment']['cuda_available']}`", | |
| 798 | + ] | |
| 799 | + if report["environment"]["gpu_name"]: | |
| 800 | + lines.append(f"- GPU: `{report['environment']['gpu_name']}` ({report['environment']['gpu_total_mem_gb']} GiB)") | |
| 801 | + | |
| 802 | + lines.extend( | |
| 803 | + [ | |
| 804 | + "", | |
| 805 | + "## Reading Guide", | |
| 806 | + "", | |
| 807 | + "- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.", | |
| 808 | + "- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.", | |
| 809 | + "- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.", | |
| 810 | + "", | |
| 811 | + ] | |
| 812 | + ) | |
| 813 | + | |
| 814 | + for item in report["scenarios"]: | |
| 815 | + lines.extend( | |
| 816 | + [ | |
| 817 | + f"## {item['scenario']['name']}", | |
| 818 | + "", | |
| 819 | + f"- Direction: `{item['scenario']['source_lang']} -> {item['scenario']['target_lang']}`", | |
| 820 | + f"- Column: `{item['scenario']['column']}`", | |
| 821 | + f"- Loaded rows: `{item['dataset']['rows_loaded']}`", | |
| 822 | + f"- Load time: `{item['runtime_defaults']['load_seconds']} s`", | |
| 823 | + f"- Device: `{item['runtime_defaults']['device']}`", | |
| 824 | + f"- DType: `{item['runtime_defaults']['torch_dtype']}`", | |
| 825 | + f"- Cache disabled: `{item['config']['cache_disabled']}`", | |
| 826 | + "", | |
| 827 | + ] | |
| 828 | + ) | |
| 829 | + lines.extend(render_case_table("Batch Sweep (`concurrency=1`)", item["batch_sweep"], include_batch=True, include_concurrency=False)) | |
| 830 | + lines.extend( | |
| 831 | + render_case_table( | |
| 832 | + f"Concurrency Sweep (`batch_size={item['config']['concurrency_batch_size']}`)", | |
| 833 | + item["concurrency_sweep"], | |
| 834 | + include_batch=False, | |
| 835 | + include_concurrency=True, | |
| 836 | + ) | |
| 837 | + ) | |
| 838 | + lines.extend(render_case_table("Batch x Concurrency Matrix", item["matrix"], include_batch=True, include_concurrency=True)) | |
| 839 | + return "\n".join(lines) | |
| 840 | + | |
| 841 | + | |
| 842 | +def render_markdown_report(report: Dict[str, Any]) -> str: | |
| 843 | + if report["suite"] == "extended": | |
| 844 | + return render_extended_markdown_report(report) | |
| 845 | + return render_baseline_markdown_report(report) | |
| 846 | + | |
| 847 | + | |
| 387 | 848 | def main() -> None: |
| 388 | 849 | args = parse_args() |
| 389 | 850 | if args.single: |
| 390 | - result = benchmark_single_scenario(args) | |
| 851 | + if args.suite == "extended": | |
| 852 | + result = benchmark_extended_scenario(args) | |
| 853 | + else: | |
| 854 | + result = benchmark_single_scenario(args) | |
| 391 | 855 | print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False)) |
| 392 | 856 | return |
| 393 | 857 | |
| 394 | 858 | report = run_all_scenarios(args) |
| 395 | 859 | output_dir = resolve_output_dir(args.output_dir) |
| 396 | 860 | timestamp = datetime.now().strftime("%H%M%S") |
| 397 | - json_path = output_dir / f"translation_local_models_{timestamp}.json" | |
| 398 | - md_path = output_dir / f"translation_local_models_{timestamp}.md" | |
| 861 | + suffix = "extended" if args.suite == "extended" else "baseline" | |
| 862 | + json_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.json" | |
| 863 | + md_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.md" | |
| 399 | 864 | json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") |
| 400 | 865 | md_path.write_text(render_markdown_report(report), encoding="utf-8") |
| 401 | 866 | |
| 402 | 867 | print(f"JSON report: {json_path}") |
| 403 | 868 | print(f"Markdown report: {md_path}") |
| 404 | 869 | for item in report["scenarios"]: |
| 405 | - runtime = item["runtime"] | |
| 406 | - print( | |
| 407 | - f"{item['scenario']['name']}: " | |
| 408 | - f"{runtime['items_per_second']} items/s | " | |
| 409 | - f"avg_item={runtime['avg_item_latency_ms']} ms | " | |
| 410 | - f"p95_batch={runtime['batch_latency_p95_ms']} ms | " | |
| 411 | - f"load={runtime['load_seconds']} s" | |
| 412 | - ) | |
| 870 | + if args.suite == "extended": | |
| 871 | + best_batch = max(item["batch_sweep"], key=lambda x: x["items_per_second"]) | |
| 872 | + best_concurrency = max(item["concurrency_sweep"], key=lambda x: x["items_per_second"]) | |
| 873 | + print( | |
| 874 | + f"{item['scenario']['name']}: " | |
| 875 | + f"best_batch={best_batch['batch_size']} ({best_batch['items_per_second']} items/s) | " | |
| 876 | + f"best_concurrency={best_concurrency['concurrency']} ({best_concurrency['items_per_second']} items/s @ batch={best_concurrency['batch_size']})" | |
| 877 | + ) | |
| 878 | + else: | |
| 879 | + runtime = item["runtime"] | |
| 880 | + print( | |
| 881 | + f"{item['scenario']['name']}: " | |
| 882 | + f"{runtime['items_per_second']} items/s | " | |
| 883 | + f"avg_item={runtime['avg_item_latency_ms']} ms | " | |
| 884 | + f"p95_req={runtime['request_latency_p95_ms']} ms | " | |
| 885 | + f"load={runtime['load_seconds']} s" | |
| 886 | + ) | |
| 413 | 887 | |
| 414 | 888 | |
| 415 | 889 | if __name__ == "__main__": | ... | ... |
translation/README.md
| ... | ... | @@ -13,7 +13,7 @@ |
| 13 | 13 | - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh) |
| 14 | 14 | - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py) |
| 15 | 15 | - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) |
| 16 | -- 性能报告:[`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) | |
| 16 | +- 性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) | |
| 17 | 17 | |
| 18 | 18 | ## 1. 设计目标 |
| 19 | 19 | |
| ... | ... | @@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \ |
| 530 | 530 | 数据集: |
| 531 | 531 | - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) |
| 532 | 532 | |
| 533 | +最新报告: | |
| 534 | +- 摘要:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) | |
| 535 | +- 完整 Markdown:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) | |
| 536 | +- 完整 JSON:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) | |
| 537 | + | |
| 538 | +### 10.1 先看哪组数据 | |
| 539 | + | |
| 540 | +这里把 3 类结果分开看,不再混在一张表里: | |
| 541 | + | |
| 542 | +- `batch_sweep` | |
| 543 | + 固定 `concurrency=1`,只比较不同 `batch_size` 的单流批处理性能 | |
| 544 | +- `concurrency_sweep` | |
| 545 | + 固定 `batch_size=1`,看“单条请求”在不同并发下的延迟和吞吐 | |
| 546 | +- `batch x concurrency matrix` | |
| 547 | + 同时看 `batch_size` 和 `concurrency` 的交互效应;本轮限制为 `batch_size * concurrency <= 128` | |
| 548 | + | |
| 549 | +建议: | |
| 550 | + | |
| 551 | +- 看线上 query 翻译延迟:优先看 `concurrency_sweep` | |
| 552 | +- 看离线批量翻译吞吐:优先看 `batch_sweep` | |
| 553 | +- 看单 worker 服务容量边界:再看 `batch x concurrency matrix` | |
| 554 | + | |
| 555 | +### 10.2 本轮补测参数 | |
| 556 | + | |
| 557 | +测试时间:`2026-03-18` | |
| 558 | + | |
| 559 | +环境: | |
| 560 | +- GPU:`Tesla T4 16GB` | |
| 561 | +- Python env:`.venv-translator` | |
| 562 | +- Torch / Transformers:`2.10.0+cu128 / 5.3.0` | |
| 563 | + | |
| 564 | +统一参数: | |
| 565 | +- cache:关闭(`--disable-cache`),避免缓存命中干扰性能结果 | |
| 566 | +- `batch_sweep`:每档 `256` items | |
| 567 | +- `concurrency_sweep`:固定 `batch_size=1`,每档 `32` requests | |
| 568 | +- `batch x concurrency matrix`:每档 `32` requests,且只保留 `batch_size * concurrency <= 128` | |
| 569 | +- 预热:`1` batch | |
| 570 | + | |
| 533 | 571 | 复现命令: |
| 534 | 572 | |
| 535 | 573 | ```bash |
| ... | ... | @@ -537,16 +575,36 @@ cd /data/saas-search |
| 537 | 575 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py |
| 538 | 576 | ``` |
| 539 | 577 | |
| 540 | -单模型复现示例: | |
| 578 | +本轮扩展压测复现命令: | |
| 579 | + | |
| 580 | +```bash | |
| 581 | +cd /data/saas-search | |
| 582 | +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | |
| 583 | + --suite extended \ | |
| 584 | + --disable-cache \ | |
| 585 | + --serial-items-per-case 256 \ | |
| 586 | + --concurrency-requests-per-case 32 \ | |
| 587 | + --concurrency-batch-size 1 \ | |
| 588 | + --output-dir perf_reports/20260318/translation_local_models | |
| 589 | +``` | |
| 590 | + | |
| 591 | +单模型扩展压测示例: | |
| 541 | 592 | |
| 542 | 593 | ```bash |
| 543 | 594 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ |
| 544 | 595 | --single \ |
| 596 | + --suite extended \ | |
| 545 | 597 | --model opus-mt-zh-en \ |
| 546 | 598 | --source-lang zh \ |
| 547 | 599 | --target-lang en \ |
| 548 | 600 | --column title_cn \ |
| 549 | - --scene sku_name | |
| 601 | + --scene sku_name \ | |
| 602 | + --disable-cache \ | |
| 603 | + --batch-size-list 1,4,8,16,32,64 \ | |
| 604 | + --concurrency-list 1,2,4,8,16,64 \ | |
| 605 | + --serial-items-per-case 256 \ | |
| 606 | + --concurrency-requests-per-case 32 \ | |
| 607 | + --concurrency-batch-size 1 | |
| 550 | 608 | ``` |
| 551 | 609 | |
| 552 | 610 | 单条请求延迟复现: |
| ... | ... | @@ -554,37 +612,143 @@ cd /data/saas-search |
| 554 | 612 | ```bash |
| 555 | 613 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ |
| 556 | 614 | --single \ |
| 615 | + --suite extended \ | |
| 557 | 616 | --model nllb-200-distilled-600m \ |
| 558 | 617 | --source-lang zh \ |
| 559 | 618 | --target-lang en \ |
| 560 | 619 | --column title_cn \ |
| 561 | 620 | --scene sku_name \ |
| 562 | - --batch-size 1 \ | |
| 563 | - --limit 100 | |
| 621 | + --disable-cache \ | |
| 622 | + --batch-size-list 1 \ | |
| 623 | + --concurrency-list 1,2,4,8,16,64 \ | |
| 624 | + --serial-items-per-case 256 \ | |
| 625 | + --concurrency-requests-per-case 32 \ | |
| 626 | + --concurrency-batch-size 1 | |
| 564 | 627 | ``` |
| 565 | 628 | |
| 566 | -说明: | |
| 567 | -- 对当前脚本和本地 backend 来说,“单条请求”可以直接等价理解为 `batch_size=1` | |
| 568 | -- 此时脚本里的 `batch_latency_*`,就可以直接视为“单次请求延迟”指标 | |
| 569 | -- 线上搜索 query 翻译更应该关注这组数据,而不是大 batch 吞吐 | |
| 629 | +### 10.3 单流 batch 结果 | |
| 630 | + | |
| 631 | +这组只看 `concurrency=1`,不要把这里的 `request p95` 当作线上并发请求的 p95。 | |
| 632 | + | |
| 633 | +`nllb-200-distilled-600m zh -> en` | |
| 634 | + | |
| 635 | +| Batch | Items/s | Avg item ms | Req p95 ms | | |
| 636 | +|---:|---:|---:|---:| | |
| 637 | +| 1 | 2.91 | 343.488 | 616.27 | | |
| 638 | +| 4 | 8.44 | 118.545 | 722.95 | | |
| 639 | +| 8 | 14.85 | 67.335 | 728.47 | | |
| 640 | +| 16 | 27.28 | 36.662 | 769.18 | | |
| 641 | +| 32 | 38.6 | 25.908 | 1369.88 | | |
| 642 | +| 64 | 58.3 | 17.152 | 1659.9 | | |
| 643 | + | |
| 644 | +`nllb-200-distilled-600m en -> zh` | |
| 645 | + | |
| 646 | +| Batch | Items/s | Avg item ms | Req p95 ms | | |
| 647 | +|---:|---:|---:|---:| | |
| 648 | +| 1 | 1.91 | 524.917 | 866.33 | | |
| 649 | +| 4 | 4.94 | 202.473 | 1599.74 | | |
| 650 | +| 8 | 8.25 | 121.188 | 1632.29 | | |
| 651 | +| 16 | 13.52 | 73.956 | 1649.65 | | |
| 652 | +| 32 | 21.27 | 47.017 | 1827.16 | | |
| 653 | +| 64 | 32.64 | 30.641 | 2031.25 | | |
| 570 | 654 | |
| 571 | -当前单条请求实测(`Tesla T4`,`limit=100`): | |
| 572 | -- `nllb-200-distilled-600m zh->en`:p50 约 `292.54 ms`,p95 约 `624.12 ms`,平均约 `321.91 ms` | |
| 573 | -- `nllb-200-distilled-600m en->zh`:p50 约 `481.61 ms`,p95 约 `1171.71 ms`,平均约 `542.47 ms` | |
| 655 | +`opus-mt-zh-en zh -> en` | |
| 574 | 656 | |
| 575 | -当前压测环境: | |
| 576 | -- GPU:`Tesla T4 16GB` | |
| 577 | -- Python env:`.venv-translator` | |
| 578 | -- 数据量:`18,576` 条商品标题 | |
| 657 | +| Batch | Items/s | Avg item ms | Req p95 ms | | |
| 658 | +|---:|---:|---:|---:| | |
| 659 | +| 1 | 6.15 | 162.536 | 274.74 | | |
| 660 | +| 4 | 15.34 | 65.192 | 356.0 | | |
| 661 | +| 8 | 25.51 | 39.202 | 379.84 | | |
| 662 | +| 16 | 41.44 | 24.129 | 797.93 | | |
| 663 | +| 32 | 54.36 | 18.397 | 1693.31 | | |
| 664 | +| 64 | 70.15 | 14.255 | 2161.59 | | |
| 665 | + | |
| 666 | +`opus-mt-en-zh en -> zh` | |
| 667 | + | |
| 668 | +| Batch | Items/s | Avg item ms | Req p95 ms | | |
| 669 | +|---:|---:|---:|---:| | |
| 670 | +| 1 | 4.53 | 220.598 | 411.57 | | |
| 671 | +| 4 | 10.12 | 98.844 | 761.49 | | |
| 672 | +| 8 | 14.63 | 68.361 | 1930.85 | | |
| 673 | +| 16 | 24.33 | 41.1 | 2098.54 | | |
| 674 | +| 32 | 33.91 | 29.487 | 2152.28 | | |
| 675 | +| 64 | 42.47 | 23.547 | 2371.85 | | |
| 676 | + | |
| 677 | +批处理结论: | |
| 678 | + | |
| 679 | +- 纯吞吐看,4 个方向的最佳 raw throughput 都出现在 `batch_size=64` | |
| 680 | +- 如果还要兼顾单个 batch 的尾延迟,`batch_size=16` 往往更均衡 | |
| 681 | +- `opus-mt-zh-en` 是本轮 bulk 场景最快模型,`nllb en->zh` 最慢 | |
| 682 | + | |
| 683 | +### 10.4 单条请求并发结果 | |
| 684 | + | |
| 685 | +这组固定 `batch_size=1`,可以直接理解成“单条请求在不同并发下的表现”。 | |
| 686 | + | |
| 687 | +`nllb-200-distilled-600m zh -> en` | |
| 688 | + | |
| 689 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | |
| 690 | +|---:|---:|---:|---:|---:| | |
| 691 | +| 1 | 4.17 | 239.99 | 226.34 | 373.27 | | |
| 692 | +| 2 | 4.1 | 477.99 | 459.36 | 703.96 | | |
| 693 | +| 4 | 4.1 | 910.74 | 884.71 | 1227.01 | | |
| 694 | +| 8 | 4.04 | 1697.73 | 1818.48 | 2383.8 | | |
| 695 | +| 16 | 4.07 | 2801.91 | 3473.63 | 4145.92 | | |
| 696 | +| 64 | 4.04 | 3714.49 | 3610.08 | 7337.3 | | |
| 697 | + | |
| 698 | +`nllb-200-distilled-600m en -> zh` | |
| 699 | + | |
| 700 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | |
| 701 | +|---:|---:|---:|---:|---:| | |
| 702 | +| 1 | 2.16 | 463.18 | 439.54 | 670.78 | | |
| 703 | +| 2 | 2.15 | 920.48 | 908.27 | 1213.3 | | |
| 704 | +| 4 | 2.16 | 1759.87 | 1771.58 | 2158.04 | | |
| 705 | +| 8 | 2.15 | 3284.44 | 3658.45 | 3971.01 | | |
| 706 | +| 16 | 2.14 | 5669.15 | 7117.7 | 7522.48 | | |
| 707 | +| 64 | 2.14 | 7631.14 | 7510.97 | 14139.03 | | |
| 708 | + | |
| 709 | +`opus-mt-zh-en zh -> en` | |
| 710 | + | |
| 711 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | |
| 712 | +|---:|---:|---:|---:|---:| | |
| 713 | +| 1 | 9.21 | 108.53 | 91.7 | 179.12 | | |
| 714 | +| 2 | 8.92 | 219.19 | 212.29 | 305.34 | | |
| 715 | +| 4 | 9.09 | 411.76 | 420.08 | 583.97 | | |
| 716 | +| 8 | 8.85 | 784.14 | 835.73 | 1043.06 | | |
| 717 | +| 16 | 9.01 | 1278.4 | 1483.34 | 1994.56 | | |
| 718 | +| 64 | 8.82 | 1687.08 | 1563.48 | 3381.58 | | |
| 719 | + | |
| 720 | +`opus-mt-en-zh en -> zh` | |
| 721 | + | |
| 722 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | |
| 723 | +|---:|---:|---:|---:|---:| | |
| 724 | +| 1 | 3.6 | 277.73 | 145.85 | 1180.37 | | |
| 725 | +| 2 | 3.55 | 559.38 | 346.71 | 1916.96 | | |
| 726 | +| 4 | 3.53 | 997.71 | 721.04 | 2944.17 | | |
| 727 | +| 8 | 3.51 | 1644.28 | 1590.93 | 3632.99 | | |
| 728 | +| 16 | 3.5 | 2600.18 | 2586.34 | 5554.04 | | |
| 729 | +| 64 | 3.52 | 3366.52 | 2780.0 | 7950.41 | | |
| 579 | 730 | |
| 580 | -最终性能结果: | |
| 731 | +并发结论: | |
| 732 | + | |
| 733 | +- 当前本地 seq2seq backend 内部是单模型锁,单 worker 下提高客户端并发基本不会提升吞吐,主要会把等待时间堆到请求延迟上 | |
| 734 | +- 线上 query 翻译如果追求稳定延迟,应优先控制在低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 | |
| 735 | +- 在线场景里,`opus-mt-zh-en` 延迟最稳;`nllb en->zh` 最慢,且并发放大后尾延迟最明显 | |
| 736 | + | |
| 737 | +### 10.5 batch x concurrency 怎么看 | |
| 738 | + | |
| 739 | +完整矩阵见: | |
| 740 | +- [`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) | |
| 741 | + | |
| 742 | +这张表主要回答两个问题: | |
| 581 | 743 | |
| 582 | -| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | | |
| 583 | -|---|---|---:|---:|---:|---:|---:|---:|---:|---:| | |
| 584 | -| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | | |
| 585 | -| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | | |
| 586 | -| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 | | |
| 587 | -| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 | | |
| 744 | +- 如果已经知道自己要跑离线批处理,`batch_size` 拉大后,在不同并发下吞吐会不会继续涨 | |
| 745 | +- 如果要拿单 worker 服务扛请求,在哪个 `batch_size x concurrency` 组合下开始明显排队 | |
| 746 | + | |
| 747 | +本轮矩阵的共同特征: | |
| 748 | + | |
| 749 | +- 吞吐主要由 `batch_size` 决定,`concurrency` 不是主要增益来源 | |
| 750 | +- 在 `batch_size` 固定时,`concurrency` 从 `1` 升到 `2/4/8/...`,`items/s` 变化很小,但 `avg req ms / p95` 会持续抬升 | |
| 751 | +- 因此当前实现更像“单 worker + 内部串行 GPU 推理服务”,不是一个靠客户端并发放大吞吐的服务 | |
| 588 | 752 | |
| 589 | 753 | NLLB 性能优化经验: |
| 590 | 754 | |
| ... | ... | @@ -632,7 +796,7 @@ NLLB 性能优化经验: |
| 632 | 796 | - 运行方式:单 worker,避免重复加载 |
| 633 | 797 | |
| 634 | 798 | 更详细的性能说明见: |
| 635 | -- [`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) | |
| 799 | +- [`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) | |
| 636 | 800 | |
| 637 | 801 | ## 11. 开发说明 |
| 638 | 802 | ... | ... |