Commit 2a6d9d76f52556f65e4c3291fc402660f6a21817
1 parent
cd4ce66d
更新了压测脚本和文档,让“单条请求 / 单流 batch / 并发 /
batch×并发矩阵”彻底分开展示。 改动在这几处: scripts/benchmark_translation_local_models.py:新增 --suite extended,支持 batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size * concurrency <= 128 的组合矩阵;并且单场景模式现在只加载目标模型,load_seconds 更干净,也支持 --disable-cache。 translation/README.md:把性能章节拆成了 batch_sweep、concurrency_sweep、batch x concurrency matrix 三块,补了这次复测的参数、复现命令和摘要表。 perf_reports/20260318/translation_local_models/README.md:新增本轮补测摘要。 完整结果在 translation_local_models_extended_221846.md 和 translation_local_models_extended_221846.json。 这次补测的核心结论很明确: 在线单条请求应该看 concurrency_sweep,也就是固定 batch_size=1 的表。 离线批量吞吐应该看 batch_sweep,4 个方向的最高 raw throughput 都出现在 batch_size=64,但更均衡的默认值仍更像 batch_size=16。 当前本地 seq2seq backend 有单模型锁,提升客户端并发几乎不涨吞吐,主要是把排队时间变成更高的 p95;所以并发更像“延迟预算”问题,不是“扩容吞吐”手段。 本轮在线单条里最快的是 opus-mt-zh-en;最慢、且并发放大最明显的是 nllb-200-distilled-600m en->zh。
Showing
17 changed files
with
1138 additions
and
117 deletions
Show diff stats
docs/TODO.txt
| 1 | 1 | ||
| 2 | 2 | ||
| 3 | -product_enrich : Partial Mode | 3 | + |
| 4 | + | ||
| 5 | +nllb-200-distilled-600M性能优化 | ||
| 6 | +请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型,有哪些性能优化方案,提高线上翻译服务的吞吐量、降低耗时,搜索相关的在线推理服务方案,找到高性能的服务化方法 | ||
| 7 | + | ||
| 8 | +cnclip的性能优化 | ||
| 9 | + | ||
| 10 | +rerank 性能优化 | ||
| 11 | + | ||
| 12 | + | ||
| 13 | +超时 | ||
| 14 | +Query 分析阶段等待翻译/embedding 的硬超时 | ||
| 15 | +配置文件位置:config/config.yaml | ||
| 16 | +配置项:query_config.async_wait_timeout_ms: 80 | ||
| 17 | +代码生效点:query/query_parser.py 使用该值换算成秒传给 wait(...) | ||
| 18 | +2) Embedding HTTP 调用超时(Text/Image) | ||
| 19 | +不再使用任何环境变量覆盖(之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用) | ||
| 20 | +配置文件位置:config/config.yaml | ||
| 21 | +配置项:services.embedding.providers.http.timeout_sec(已在 YAML 里补了示例默认 60) | ||
| 22 | +代码生效点: | ||
| 23 | +embeddings/text_encoder.py:requests.post(..., timeout=self.timeout_sec) | ||
| 24 | +embeddings/image_encoder.py:requests.post(..., timeout=self.timeout_sec) | ||
| 25 | + | ||
| 26 | + | ||
| 27 | + | ||
| 28 | + | ||
| 29 | +product_enrich : Partial Mode : done | ||
| 4 | https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR | 30 | https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR |
| 5 | 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下: | 31 | 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下: |
| 6 | [ | 32 | [ |
| @@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men | @@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men | ||
| 15 | } | 41 | } |
| 16 | ] | 42 | ] |
| 17 | 模型会以前缀内容为起点开始生成。 | 43 | 模型会以前缀内容为起点开始生成。 |
| 18 | - | ||
| 19 | 支持 非思考模式。 | 44 | 支持 非思考模式。 |
| 20 | 45 | ||
| 21 | 46 | ||
| @@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men | @@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men | ||
| 41 | 66 | ||
| 42 | 67 | ||
| 43 | 68 | ||
| 44 | -翻译的cache需要重构 | ||
| 45 | - | ||
| 46 | - | ||
| 47 | - | ||
| 48 | - | ||
| 49 | - | ||
| 50 | 69 | ||
| 51 | suggest 索引,现在是全量脚本,要交给金伟 | 70 | suggest 索引,现在是全量脚本,要交给金伟 |
| 52 | 71 |
perf_reports/20260318/translation_local_models/README.md
0 → 100644
| @@ -0,0 +1,101 @@ | @@ -0,0 +1,101 @@ | ||
| 1 | +# Local Translation Model Benchmark Report | ||
| 2 | + | ||
| 3 | +测试脚本: | ||
| 4 | +- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) | ||
| 5 | + | ||
| 6 | +完整结果: | ||
| 7 | +- Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) | ||
| 8 | +- JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) | ||
| 9 | + | ||
| 10 | +测试时间: | ||
| 11 | +- `2026-03-18` | ||
| 12 | + | ||
| 13 | +环境: | ||
| 14 | +- GPU:`Tesla T4 16GB` | ||
| 15 | +- Driver / CUDA:`570.158.01 / 12.8` | ||
| 16 | +- Python env:`.venv-translator` | ||
| 17 | +- Torch / Transformers:`2.10.0+cu128 / 5.3.0` | ||
| 18 | +- 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) | ||
| 19 | + | ||
| 20 | +## Method | ||
| 21 | + | ||
| 22 | +这轮把结果拆成 3 类: | ||
| 23 | + | ||
| 24 | +- `batch_sweep` | ||
| 25 | + 固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64` | ||
| 26 | +- `concurrency_sweep` | ||
| 27 | + 固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64` | ||
| 28 | +- `batch x concurrency matrix` | ||
| 29 | + 组合压测,保留 `batch_size * concurrency <= 128` | ||
| 30 | + | ||
| 31 | +统一设定: | ||
| 32 | +- 关闭 cache:`--disable-cache` | ||
| 33 | +- `batch_sweep`:每档 `256` items | ||
| 34 | +- `concurrency_sweep`:每档 `32` requests | ||
| 35 | +- `matrix`:每档 `32` requests | ||
| 36 | +- 预热:`1` batch | ||
| 37 | + | ||
| 38 | +复现命令: | ||
| 39 | + | ||
| 40 | +```bash | ||
| 41 | +cd /data/saas-search | ||
| 42 | +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | ||
| 43 | + --suite extended \ | ||
| 44 | + --disable-cache \ | ||
| 45 | + --serial-items-per-case 256 \ | ||
| 46 | + --concurrency-requests-per-case 32 \ | ||
| 47 | + --concurrency-batch-size 1 \ | ||
| 48 | + --output-dir perf_reports/20260318/translation_local_models | ||
| 49 | +``` | ||
| 50 | + | ||
| 51 | +## Key Results | ||
| 52 | + | ||
| 53 | +### 1. 单流 batch sweep | ||
| 54 | + | ||
| 55 | +| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms | | ||
| 56 | +|---|---|---:|---:|---:|---:| | ||
| 57 | +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` | | ||
| 58 | +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` | | ||
| 59 | +| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` | | ||
| 60 | +| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` | | ||
| 61 | + | ||
| 62 | +解读: | ||
| 63 | +- 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高 | ||
| 64 | +- 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选 | ||
| 65 | +- 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh` | ||
| 66 | + | ||
| 67 | +### 2. 单条请求并发 sweep | ||
| 68 | + | ||
| 69 | +| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms | | ||
| 70 | +|---|---|---:|---:|---:|---:| | ||
| 71 | +| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` | | ||
| 72 | +| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` | | ||
| 73 | +| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` | | ||
| 74 | +| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` | | ||
| 75 | + | ||
| 76 | +解读: | ||
| 77 | +- `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟 | ||
| 78 | +- 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 | ||
| 79 | +- 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh` | ||
| 80 | + | ||
| 81 | +### 3. batch x concurrency matrix | ||
| 82 | + | ||
| 83 | +最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下): | ||
| 84 | + | ||
| 85 | +| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms | | ||
| 86 | +|---|---|---:|---:|---:|---:|---:| | ||
| 87 | +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` | | ||
| 88 | +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` | | ||
| 89 | +| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` | | ||
| 90 | +| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` | | ||
| 91 | + | ||
| 92 | +解读: | ||
| 93 | +- 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定 | ||
| 94 | +- 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升 | ||
| 95 | +- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐 | ||
| 96 | + | ||
| 97 | +## Recommendation | ||
| 98 | + | ||
| 99 | +- 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径 | ||
| 100 | +- 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64` | ||
| 101 | +- 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段 |
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl
0 → 100644
perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md
0 → 100644
| @@ -0,0 +1,263 @@ | @@ -0,0 +1,263 @@ | ||
| 1 | +# Local Translation Model Extended Benchmark | ||
| 2 | + | ||
| 3 | +- Generated at: `2026-03-18T21:28:09` | ||
| 4 | +- Suite: `extended` | ||
| 5 | +- Python: `3.12.3` | ||
| 6 | +- Torch: `2.10.0+cu128` | ||
| 7 | +- Transformers: `5.3.0` | ||
| 8 | +- CUDA: `True` | ||
| 9 | +- GPU: `Tesla T4` (15.56 GiB) | ||
| 10 | + | ||
| 11 | +## Reading Guide | ||
| 12 | + | ||
| 13 | +- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes. | ||
| 14 | +- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises. | ||
| 15 | +- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured. | ||
| 16 | + | ||
| 17 | +## nllb-200-distilled-600m zh->en | ||
| 18 | + | ||
| 19 | +- Direction: `zh -> en` | ||
| 20 | +- Column: `title_cn` | ||
| 21 | +- Loaded rows: `2048` | ||
| 22 | +- Load time: `6.118 s` | ||
| 23 | +- Device: `cuda` | ||
| 24 | +- DType: `torch.float16` | ||
| 25 | +- Cache disabled: `True` | ||
| 26 | + | ||
| 27 | +### Batch Sweep (`concurrency=1`) | ||
| 28 | + | ||
| 29 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 30 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 31 | +| 1 | 256 | 256 | 2.91 | 2.91 | 343.48 | 315.98 | 616.27 | 2.633 | | ||
| 32 | +| 4 | 256 | 64 | 8.44 | 2.11 | 474.17 | 474.75 | 722.95 | 2.659 | | ||
| 33 | +| 8 | 256 | 32 | 14.85 | 1.86 | 538.68 | 564.97 | 728.47 | 2.699 | | ||
| 34 | +| 16 | 256 | 16 | 27.28 | 1.7 | 586.59 | 633.16 | 769.18 | 2.765 | | ||
| 35 | +| 32 | 256 | 8 | 38.6 | 1.21 | 829.04 | 761.74 | 1369.88 | 2.987 | | ||
| 36 | +| 64 | 256 | 4 | 58.3 | 0.91 | 1097.73 | 919.35 | 1659.9 | 3.347 | | ||
| 37 | + | ||
| 38 | +### Concurrency Sweep (`batch_size=1`) | ||
| 39 | + | ||
| 40 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 41 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 42 | +| 1 | 32 | 32 | 4.17 | 4.17 | 239.99 | 226.34 | 373.27 | 3.347 | | ||
| 43 | +| 2 | 32 | 32 | 4.1 | 4.1 | 477.99 | 459.36 | 703.96 | 3.347 | | ||
| 44 | +| 4 | 32 | 32 | 4.1 | 4.1 | 910.74 | 884.71 | 1227.01 | 3.347 | | ||
| 45 | +| 8 | 32 | 32 | 4.04 | 4.04 | 1697.73 | 1818.48 | 2383.8 | 3.347 | | ||
| 46 | +| 16 | 32 | 32 | 4.07 | 4.07 | 2801.91 | 3473.63 | 4145.92 | 3.347 | | ||
| 47 | +| 64 | 32 | 32 | 4.04 | 4.04 | 3714.49 | 3610.08 | 7337.3 | 3.347 | | ||
| 48 | + | ||
| 49 | +### Batch x Concurrency Matrix | ||
| 50 | + | ||
| 51 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 52 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 53 | +| 1 | 1 | 32 | 32 | 3.94 | 3.94 | 253.96 | 231.8 | 378.76 | 3.347 | | ||
| 54 | +| 1 | 2 | 32 | 32 | 3.92 | 3.92 | 500.35 | 480.18 | 747.3 | 3.347 | | ||
| 55 | +| 1 | 4 | 32 | 32 | 4.05 | 4.05 | 922.53 | 894.57 | 1235.7 | 3.347 | | ||
| 56 | +| 1 | 8 | 32 | 32 | 4.08 | 4.08 | 1679.71 | 1806.95 | 2346.11 | 3.347 | | ||
| 57 | +| 1 | 16 | 32 | 32 | 4.05 | 4.05 | 2816.51 | 3485.68 | 4181.39 | 3.347 | | ||
| 58 | +| 1 | 64 | 32 | 32 | 4.06 | 4.06 | 3711.9 | 3614.34 | 7319.52 | 3.347 | | ||
| 59 | +| 4 | 1 | 128 | 32 | 9.12 | 2.28 | 438.42 | 386.12 | 724.47 | 3.347 | | ||
| 60 | +| 4 | 2 | 128 | 32 | 9.2 | 2.3 | 852.37 | 756.47 | 1366.7 | 3.347 | | ||
| 61 | +| 4 | 4 | 128 | 32 | 9.17 | 2.29 | 1642.9 | 1524.56 | 2662.49 | 3.347 | | ||
| 62 | +| 4 | 8 | 128 | 32 | 9.18 | 2.29 | 2974.7 | 3168.19 | 4594.99 | 3.347 | | ||
| 63 | +| 4 | 16 | 128 | 32 | 9.23 | 2.31 | 4897.43 | 5815.91 | 8210.23 | 3.347 | | ||
| 64 | +| 8 | 1 | 256 | 32 | 14.73 | 1.84 | 543.03 | 556.97 | 734.67 | 3.347 | | ||
| 65 | +| 8 | 2 | 256 | 32 | 14.73 | 1.84 | 1064.16 | 1132.78 | 1405.53 | 3.347 | | ||
| 66 | +| 8 | 4 | 256 | 32 | 14.79 | 1.85 | 2035.46 | 2237.69 | 2595.53 | 3.347 | | ||
| 67 | +| 8 | 8 | 256 | 32 | 14.76 | 1.84 | 3763.14 | 4345.6 | 5025.02 | 3.347 | | ||
| 68 | +| 8 | 16 | 256 | 32 | 14.77 | 1.85 | 6383.08 | 8053.36 | 9380.87 | 3.347 | | ||
| 69 | +| 16 | 1 | 512 | 32 | 24.88 | 1.55 | 643.11 | 661.91 | 814.26 | 3.347 | | ||
| 70 | +| 16 | 2 | 512 | 32 | 24.91 | 1.56 | 1265.17 | 1326.85 | 1576.96 | 3.347 | | ||
| 71 | +| 16 | 4 | 512 | 32 | 24.94 | 1.56 | 2446.55 | 2594.26 | 3027.37 | 3.347 | | ||
| 72 | +| 16 | 8 | 512 | 32 | 24.8 | 1.55 | 4548.92 | 5035.56 | 5852.87 | 3.347 | | ||
| 73 | +| 32 | 1 | 1024 | 32 | 34.9 | 1.09 | 916.78 | 823.68 | 1637.59 | 3.347 | | ||
| 74 | +| 32 | 2 | 1024 | 32 | 34.98 | 1.09 | 1807.39 | 1717.89 | 2514.27 | 3.347 | | ||
| 75 | +| 32 | 4 | 1024 | 32 | 34.85 | 1.09 | 3513.42 | 3779.99 | 4750.34 | 3.347 | | ||
| 76 | +| 64 | 1 | 2048 | 32 | 53.24 | 0.83 | 1202.07 | 993.59 | 1838.72 | 3.642 | | ||
| 77 | +| 64 | 2 | 2048 | 32 | 53.95 | 0.84 | 2344.92 | 2465.53 | 3562.04 | 3.642 | | ||
| 78 | + | ||
| 79 | +## nllb-200-distilled-600m en->zh | ||
| 80 | + | ||
| 81 | +- Direction: `en -> zh` | ||
| 82 | +- Column: `title` | ||
| 83 | +- Loaded rows: `2048` | ||
| 84 | +- Load time: `6.137 s` | ||
| 85 | +- Device: `cuda` | ||
| 86 | +- DType: `torch.float16` | ||
| 87 | +- Cache disabled: `True` | ||
| 88 | + | ||
| 89 | +### Batch Sweep (`concurrency=1`) | ||
| 90 | + | ||
| 91 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 92 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 93 | +| 1 | 256 | 256 | 1.91 | 1.91 | 524.91 | 497.81 | 866.33 | 2.636 | | ||
| 94 | +| 4 | 256 | 64 | 4.94 | 1.23 | 809.89 | 743.87 | 1599.74 | 2.684 | | ||
| 95 | +| 8 | 256 | 32 | 8.25 | 1.03 | 969.5 | 796.64 | 1632.29 | 2.723 | | ||
| 96 | +| 16 | 256 | 16 | 13.52 | 0.85 | 1183.29 | 1169.28 | 1649.65 | 2.817 | | ||
| 97 | +| 32 | 256 | 8 | 21.27 | 0.66 | 1504.54 | 1647.92 | 1827.16 | 3.007 | | ||
| 98 | +| 64 | 256 | 4 | 32.64 | 0.51 | 1961.02 | 1985.8 | 2031.25 | 3.408 | | ||
| 99 | + | ||
| 100 | +### Concurrency Sweep (`batch_size=1`) | ||
| 101 | + | ||
| 102 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 103 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 104 | +| 1 | 32 | 32 | 2.16 | 2.16 | 463.18 | 439.54 | 670.78 | 3.408 | | ||
| 105 | +| 2 | 32 | 32 | 2.15 | 2.15 | 920.48 | 908.27 | 1213.3 | 3.408 | | ||
| 106 | +| 4 | 32 | 32 | 2.16 | 2.16 | 1759.87 | 1771.58 | 2158.04 | 3.408 | | ||
| 107 | +| 8 | 32 | 32 | 2.15 | 2.15 | 3284.44 | 3658.45 | 3971.01 | 3.408 | | ||
| 108 | +| 16 | 32 | 32 | 2.14 | 2.14 | 5669.15 | 7117.7 | 7522.48 | 3.408 | | ||
| 109 | +| 64 | 32 | 32 | 2.14 | 2.14 | 7631.14 | 7510.97 | 14139.03 | 3.408 | | ||
| 110 | + | ||
| 111 | +### Batch x Concurrency Matrix | ||
| 112 | + | ||
| 113 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 114 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 115 | +| 1 | 1 | 32 | 32 | 2.17 | 2.17 | 461.18 | 439.28 | 673.59 | 3.408 | | ||
| 116 | +| 1 | 2 | 32 | 32 | 2.15 | 2.15 | 919.12 | 904.83 | 1209.83 | 3.408 | | ||
| 117 | +| 1 | 4 | 32 | 32 | 2.14 | 2.14 | 1771.34 | 1810.99 | 2171.71 | 3.408 | | ||
| 118 | +| 1 | 8 | 32 | 32 | 2.12 | 2.12 | 3324.85 | 3704.57 | 3999.29 | 3.408 | | ||
| 119 | +| 1 | 16 | 32 | 32 | 2.15 | 2.15 | 5651.09 | 7121.8 | 7495.67 | 3.408 | | ||
| 120 | +| 1 | 64 | 32 | 32 | 2.15 | 2.15 | 7619.16 | 7505.14 | 14111.53 | 3.408 | | ||
| 121 | +| 4 | 1 | 128 | 32 | 5.13 | 1.28 | 780.0 | 661.74 | 1586.9 | 3.408 | | ||
| 122 | +| 4 | 2 | 128 | 32 | 5.09 | 1.27 | 1551.81 | 1368.81 | 2775.92 | 3.408 | | ||
| 123 | +| 4 | 4 | 128 | 32 | 5.13 | 1.28 | 2995.57 | 2836.77 | 4532.67 | 3.408 | | ||
| 124 | +| 4 | 8 | 128 | 32 | 5.16 | 1.29 | 5619.47 | 6046.81 | 7767.22 | 3.408 | | ||
| 125 | +| 4 | 16 | 128 | 32 | 5.15 | 1.29 | 9546.3 | 12146.68 | 14537.41 | 3.408 | | ||
| 126 | +| 8 | 1 | 256 | 32 | 8.36 | 1.04 | 957.45 | 784.85 | 1591.01 | 3.408 | | ||
| 127 | +| 8 | 2 | 256 | 32 | 8.29 | 1.04 | 1906.56 | 1847.53 | 2799.1 | 3.408 | | ||
| 128 | +| 8 | 4 | 256 | 32 | 8.3 | 1.04 | 3691.59 | 3773.67 | 4849.58 | 3.408 | | ||
| 129 | +| 8 | 8 | 256 | 32 | 8.29 | 1.04 | 6954.22 | 7681.7 | 8904.0 | 3.408 | | ||
| 130 | +| 8 | 16 | 256 | 32 | 8.31 | 1.04 | 11923.83 | 14903.17 | 17072.19 | 3.408 | | ||
| 131 | +| 16 | 1 | 512 | 32 | 13.83 | 0.86 | 1156.84 | 904.4 | 1609.17 | 3.408 | | ||
| 132 | +| 16 | 2 | 512 | 32 | 13.91 | 0.87 | 2276.35 | 2356.15 | 3121.45 | 3.408 | | ||
| 133 | +| 16 | 4 | 512 | 32 | 13.89 | 0.87 | 4440.77 | 4733.34 | 5488.65 | 3.408 | | ||
| 134 | +| 16 | 8 | 512 | 32 | 13.83 | 0.86 | 8428.17 | 9407.5 | 10409.59 | 3.408 | | ||
| 135 | +| 32 | 1 | 1024 | 32 | 23.28 | 0.73 | 1374.39 | 1603.35 | 1806.99 | 3.408 | | ||
| 136 | +| 32 | 2 | 1024 | 32 | 23.26 | 0.73 | 2721.28 | 2607.91 | 3437.08 | 3.408 | | ||
| 137 | +| 32 | 4 | 1024 | 32 | 23.18 | 0.72 | 5297.48 | 5234.53 | 6758.76 | 3.408 | | ||
| 138 | +| 64 | 1 | 2048 | 32 | 34.97 | 0.55 | 1829.91 | 1899.54 | 2039.18 | 3.69 | | ||
| 139 | +| 64 | 2 | 2048 | 32 | 34.78 | 0.54 | 3615.7 | 3795.36 | 4014.19 | 3.69 | | ||
| 140 | + | ||
| 141 | +## opus-mt-zh-en zh->en | ||
| 142 | + | ||
| 143 | +- Direction: `zh -> en` | ||
| 144 | +- Column: `title_cn` | ||
| 145 | +- Loaded rows: `2048` | ||
| 146 | +- Load time: `3.2561 s` | ||
| 147 | +- Device: `cuda` | ||
| 148 | +- DType: `torch.float16` | ||
| 149 | +- Cache disabled: `True` | ||
| 150 | + | ||
| 151 | +### Batch Sweep (`concurrency=1`) | ||
| 152 | + | ||
| 153 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 154 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 155 | +| 1 | 256 | 256 | 6.15 | 6.15 | 162.53 | 145.96 | 274.74 | 0.285 | | ||
| 156 | +| 4 | 256 | 64 | 15.34 | 3.83 | 260.77 | 231.11 | 356.0 | 0.306 | | ||
| 157 | +| 8 | 256 | 32 | 25.51 | 3.19 | 313.61 | 271.09 | 379.84 | 0.324 | | ||
| 158 | +| 16 | 256 | 16 | 41.44 | 2.59 | 386.06 | 286.4 | 797.93 | 0.366 | | ||
| 159 | +| 32 | 256 | 8 | 54.36 | 1.7 | 588.7 | 330.13 | 1693.31 | 0.461 | | ||
| 160 | +| 64 | 256 | 4 | 70.15 | 1.1 | 912.33 | 447.93 | 2161.59 | 0.642 | | ||
| 161 | + | ||
| 162 | +### Concurrency Sweep (`batch_size=1`) | ||
| 163 | + | ||
| 164 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 165 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 166 | +| 1 | 32 | 32 | 9.21 | 9.21 | 108.53 | 91.7 | 179.12 | 0.642 | | ||
| 167 | +| 2 | 32 | 32 | 8.92 | 8.92 | 219.19 | 212.29 | 305.34 | 0.642 | | ||
| 168 | +| 4 | 32 | 32 | 9.09 | 9.09 | 411.76 | 420.08 | 583.97 | 0.642 | | ||
| 169 | +| 8 | 32 | 32 | 8.85 | 8.85 | 784.14 | 835.73 | 1043.06 | 0.642 | | ||
| 170 | +| 16 | 32 | 32 | 9.01 | 9.01 | 1278.4 | 1483.34 | 1994.56 | 0.642 | | ||
| 171 | +| 64 | 32 | 32 | 8.82 | 8.82 | 1687.08 | 1563.48 | 3381.58 | 0.642 | | ||
| 172 | + | ||
| 173 | +### Batch x Concurrency Matrix | ||
| 174 | + | ||
| 175 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 176 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 177 | +| 1 | 1 | 32 | 32 | 9.01 | 9.01 | 110.96 | 97.37 | 191.91 | 0.642 | | ||
| 178 | +| 1 | 2 | 32 | 32 | 9.2 | 9.2 | 212.43 | 191.27 | 299.65 | 0.642 | | ||
| 179 | +| 1 | 4 | 32 | 32 | 8.86 | 8.86 | 423.8 | 423.61 | 596.78 | 0.642 | | ||
| 180 | +| 1 | 8 | 32 | 32 | 9.04 | 9.04 | 764.39 | 810.05 | 1035.74 | 0.642 | | ||
| 181 | +| 1 | 16 | 32 | 32 | 9.06 | 9.06 | 1272.52 | 1468.35 | 1998.99 | 0.642 | | ||
| 182 | +| 1 | 64 | 32 | 32 | 9.04 | 9.04 | 1663.7 | 1556.13 | 3296.1 | 0.642 | | ||
| 183 | +| 4 | 1 | 128 | 32 | 15.17 | 3.79 | 263.61 | 201.11 | 369.93 | 0.642 | | ||
| 184 | +| 4 | 2 | 128 | 32 | 15.25 | 3.81 | 517.25 | 397.13 | 1319.45 | 0.642 | | ||
| 185 | +| 4 | 4 | 128 | 32 | 15.27 | 3.82 | 899.61 | 746.76 | 1914.27 | 0.642 | | ||
| 186 | +| 4 | 8 | 128 | 32 | 15.39 | 3.85 | 1539.87 | 1466.41 | 2918.85 | 0.642 | | ||
| 187 | +| 4 | 16 | 128 | 32 | 15.48 | 3.87 | 2455.89 | 2728.17 | 4644.99 | 0.642 | | ||
| 188 | +| 8 | 1 | 256 | 32 | 25.44 | 3.18 | 314.41 | 272.16 | 386.05 | 0.642 | | ||
| 189 | +| 8 | 2 | 256 | 32 | 25.42 | 3.18 | 620.41 | 541.88 | 1418.79 | 0.642 | | ||
| 190 | +| 8 | 4 | 256 | 32 | 24.96 | 3.12 | 1226.44 | 1070.41 | 2940.56 | 0.642 | | ||
| 191 | +| 8 | 8 | 256 | 32 | 25.44 | 3.18 | 2252.81 | 2148.42 | 4143.66 | 0.642 | | ||
| 192 | +| 8 | 16 | 256 | 32 | 25.44 | 3.18 | 3961.06 | 5028.31 | 6340.79 | 0.642 | | ||
| 193 | +| 16 | 1 | 512 | 32 | 31.66 | 1.98 | 505.38 | 321.69 | 1963.58 | 0.653 | | ||
| 194 | +| 16 | 2 | 512 | 32 | 31.41 | 1.96 | 1009.12 | 677.84 | 2341.01 | 0.653 | | ||
| 195 | +| 16 | 4 | 512 | 32 | 31.17 | 1.95 | 2006.54 | 1562.68 | 4299.85 | 0.653 | | ||
| 196 | +| 16 | 8 | 512 | 32 | 31.51 | 1.97 | 3483.96 | 4085.09 | 5902.32 | 0.653 | | ||
| 197 | +| 32 | 1 | 1024 | 32 | 38.66 | 1.21 | 827.7 | 409.78 | 2162.48 | 0.748 | | ||
| 198 | +| 32 | 2 | 1024 | 32 | 38.75 | 1.21 | 1613.28 | 916.1 | 3228.1 | 0.748 | | ||
| 199 | +| 32 | 4 | 1024 | 32 | 38.66 | 1.21 | 3162.05 | 3239.36 | 4992.07 | 0.748 | | ||
| 200 | +| 64 | 1 | 2048 | 32 | 52.44 | 0.82 | 1220.35 | 533.4 | 2508.12 | 0.939 | | ||
| 201 | +| 64 | 2 | 2048 | 32 | 52.04 | 0.81 | 2443.4 | 2800.35 | 4765.45 | 0.939 | | ||
| 202 | + | ||
| 203 | +## opus-mt-en-zh en->zh | ||
| 204 | + | ||
| 205 | +- Direction: `en -> zh` | ||
| 206 | +- Column: `title` | ||
| 207 | +- Loaded rows: `2048` | ||
| 208 | +- Load time: `3.1612 s` | ||
| 209 | +- Device: `cuda` | ||
| 210 | +- DType: `torch.float16` | ||
| 211 | +- Cache disabled: `True` | ||
| 212 | + | ||
| 213 | +### Batch Sweep (`concurrency=1`) | ||
| 214 | + | ||
| 215 | +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 216 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 217 | +| 1 | 256 | 256 | 4.53 | 4.53 | 220.59 | 177.37 | 411.57 | 0.285 | | ||
| 218 | +| 4 | 256 | 64 | 10.12 | 2.53 | 395.37 | 319.23 | 761.49 | 0.307 | | ||
| 219 | +| 8 | 256 | 32 | 14.63 | 1.83 | 546.88 | 390.81 | 1930.85 | 0.326 | | ||
| 220 | +| 16 | 256 | 16 | 24.33 | 1.52 | 657.6 | 451.22 | 2098.54 | 0.37 | | ||
| 221 | +| 32 | 256 | 8 | 33.91 | 1.06 | 943.59 | 603.45 | 2152.28 | 0.462 | | ||
| 222 | +| 64 | 256 | 4 | 42.47 | 0.66 | 1507.01 | 1531.43 | 2371.85 | 0.644 | | ||
| 223 | + | ||
| 224 | +### Concurrency Sweep (`batch_size=1`) | ||
| 225 | + | ||
| 226 | +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 227 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 228 | +| 1 | 32 | 32 | 3.6 | 3.6 | 277.73 | 145.85 | 1180.37 | 0.644 | | ||
| 229 | +| 2 | 32 | 32 | 3.55 | 3.55 | 559.38 | 346.71 | 1916.96 | 0.644 | | ||
| 230 | +| 4 | 32 | 32 | 3.53 | 3.53 | 997.71 | 721.04 | 2944.17 | 0.644 | | ||
| 231 | +| 8 | 32 | 32 | 3.51 | 3.51 | 1644.28 | 1590.93 | 3632.99 | 0.644 | | ||
| 232 | +| 16 | 32 | 32 | 3.5 | 3.5 | 2600.18 | 2586.34 | 5554.04 | 0.644 | | ||
| 233 | +| 64 | 32 | 32 | 3.52 | 3.52 | 3366.52 | 2780.0 | 7950.41 | 0.644 | | ||
| 234 | + | ||
| 235 | +### Batch x Concurrency Matrix | ||
| 236 | + | ||
| 237 | +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | | ||
| 238 | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 239 | +| 1 | 1 | 32 | 32 | 3.54 | 3.54 | 282.22 | 148.26 | 1200.93 | 0.644 | | ||
| 240 | +| 1 | 2 | 32 | 32 | 3.52 | 3.52 | 563.85 | 346.8 | 1957.63 | 0.644 | | ||
| 241 | +| 1 | 4 | 32 | 32 | 3.52 | 3.52 | 1002.79 | 728.36 | 2949.99 | 0.644 | | ||
| 242 | +| 1 | 8 | 32 | 32 | 3.55 | 3.55 | 1611.52 | 1462.84 | 3653.34 | 0.644 | | ||
| 243 | +| 1 | 16 | 32 | 32 | 3.49 | 3.49 | 2582.77 | 2594.49 | 5504.07 | 0.644 | | ||
| 244 | +| 1 | 64 | 32 | 32 | 3.53 | 3.53 | 3327.65 | 2753.11 | 7907.54 | 0.644 | | ||
| 245 | +| 4 | 1 | 128 | 32 | 10.31 | 2.58 | 387.75 | 329.42 | 675.07 | 0.644 | | ||
| 246 | +| 4 | 2 | 128 | 32 | 10.12 | 2.53 | 784.49 | 713.61 | 1561.27 | 0.644 | | ||
| 247 | +| 4 | 4 | 128 | 32 | 10.08 | 2.52 | 1543.42 | 1438.42 | 2735.85 | 0.644 | | ||
| 248 | +| 4 | 8 | 128 | 32 | 10.09 | 2.52 | 2918.06 | 2954.39 | 4327.47 | 0.644 | | ||
| 249 | +| 4 | 16 | 128 | 32 | 10.16 | 2.54 | 5087.62 | 5734.85 | 7343.34 | 0.644 | | ||
| 250 | +| 8 | 1 | 256 | 32 | 14.49 | 1.81 | 552.16 | 402.18 | 1963.8 | 0.644 | | ||
| 251 | +| 8 | 2 | 256 | 32 | 14.54 | 1.82 | 1088.96 | 845.04 | 2565.11 | 0.644 | | ||
| 252 | +| 8 | 4 | 256 | 32 | 14.48 | 1.81 | 2143.18 | 1832.69 | 4487.97 | 0.644 | | ||
| 253 | +| 8 | 8 | 256 | 32 | 14.51 | 1.81 | 4051.04 | 3734.16 | 5976.18 | 0.644 | | ||
| 254 | +| 8 | 16 | 256 | 32 | 14.55 | 1.82 | 6626.6 | 6868.91 | 9395.9 | 0.644 | | ||
| 255 | +| 16 | 1 | 512 | 32 | 19.11 | 1.19 | 837.36 | 457.51 | 2030.2 | 0.661 | | ||
| 256 | +| 16 | 2 | 512 | 32 | 19.26 | 1.2 | 1599.87 | 1141.06 | 2802.44 | 0.661 | | ||
| 257 | +| 16 | 4 | 512 | 32 | 19.1 | 1.19 | 3130.81 | 3166.45 | 4987.49 | 0.661 | | ||
| 258 | +| 16 | 8 | 512 | 32 | 19.27 | 1.2 | 5735.13 | 5351.82 | 8417.85 | 0.661 | | ||
| 259 | +| 32 | 1 | 1024 | 32 | 26.04 | 0.81 | 1228.66 | 648.06 | 2167.39 | 0.751 | | ||
| 260 | +| 32 | 2 | 1024 | 32 | 26.21 | 0.82 | 2372.52 | 2593.74 | 4206.96 | 0.751 | | ||
| 261 | +| 32 | 4 | 1024 | 32 | 26.11 | 0.82 | 4499.96 | 3977.65 | 6920.57 | 0.751 | | ||
| 262 | +| 64 | 1 | 2048 | 32 | 34.94 | 0.55 | 1831.48 | 2366.24 | 2473.74 | 0.942 | | ||
| 263 | +| 64 | 2 | 2048 | 32 | 34.75 | 0.54 | 3603.99 | 3281.6 | 4948.72 | 0.942 | |
scripts/benchmark_translation_local_models.py
| @@ -4,6 +4,7 @@ | @@ -4,6 +4,7 @@ | ||
| 4 | from __future__ import annotations | 4 | from __future__ import annotations |
| 5 | 5 | ||
| 6 | import argparse | 6 | import argparse |
| 7 | +import concurrent.futures | ||
| 7 | import copy | 8 | import copy |
| 8 | import csv | 9 | import csv |
| 9 | import json | 10 | import json |
| @@ -16,7 +17,7 @@ import sys | @@ -16,7 +17,7 @@ import sys | ||
| 16 | import time | 17 | import time |
| 17 | from datetime import datetime | 18 | from datetime import datetime |
| 18 | from pathlib import Path | 19 | from pathlib import Path |
| 19 | -from typing import Any, Dict, Iterable, List | 20 | +from typing import Any, Dict, Iterable, List, Sequence |
| 20 | 21 | ||
| 21 | import torch | 22 | import torch |
| 22 | import transformers | 23 | import transformers |
| @@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402 | @@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402 | ||
| 30 | from translation.settings import get_translation_capability # noqa: E402 | 31 | from translation.settings import get_translation_capability # noqa: E402 |
| 31 | 32 | ||
| 32 | 33 | ||
| 34 | +DEFAULT_BATCH_SIZES = [1, 4, 8, 16, 32, 64] | ||
| 35 | +DEFAULT_CONCURRENCIES = [1, 2, 4, 8, 16, 64] | ||
| 36 | + | ||
| 33 | SCENARIOS: List[Dict[str, str]] = [ | 37 | SCENARIOS: List[Dict[str, str]] = [ |
| 34 | { | 38 | { |
| 35 | "name": "nllb-200-distilled-600m zh->en", | 39 | "name": "nllb-200-distilled-600m zh->en", |
| @@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [ | @@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [ | ||
| 69 | def parse_args() -> argparse.Namespace: | 73 | def parse_args() -> argparse.Namespace: |
| 70 | parser = argparse.ArgumentParser(description="Benchmark local translation models") | 74 | parser = argparse.ArgumentParser(description="Benchmark local translation models") |
| 71 | parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path") | 75 | parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path") |
| 72 | - parser.add_argument("--limit", type=int, default=0, help="Limit rows for faster experiments; 0 means all") | 76 | + parser.add_argument("--limit", type=int, default=0, help="Limit rows for baseline or single-case run; 0 means all") |
| 73 | parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports") | 77 | parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports") |
| 74 | parser.add_argument("--single", action="store_true", help="Run a single scenario in-process") | 78 | parser.add_argument("--single", action="store_true", help="Run a single scenario in-process") |
| 75 | parser.add_argument("--model", default="", help="Model name for --single mode") | 79 | parser.add_argument("--model", default="", help="Model name for --single mode") |
| @@ -84,9 +88,67 @@ def parse_args() -> argparse.Namespace: | @@ -84,9 +88,67 @@ def parse_args() -> argparse.Namespace: | ||
| 84 | parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams") | 88 | parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams") |
| 85 | parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa") | 89 | parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa") |
| 86 | parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring") | 90 | parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring") |
| 91 | + parser.add_argument("--disable-cache", action="store_true", help="Disable translation cache during benchmarks") | ||
| 92 | + parser.add_argument( | ||
| 93 | + "--suite", | ||
| 94 | + choices=["baseline", "extended"], | ||
| 95 | + default="baseline", | ||
| 96 | + help="baseline keeps the previous all-scenarios summary; extended adds batch/concurrency/matrix sweeps", | ||
| 97 | + ) | ||
| 98 | + parser.add_argument( | ||
| 99 | + "--batch-size-list", | ||
| 100 | + default="", | ||
| 101 | + help="Comma-separated batch sizes for extended suite; default 1,4,8,16,32,64", | ||
| 102 | + ) | ||
| 103 | + parser.add_argument( | ||
| 104 | + "--concurrency-list", | ||
| 105 | + default="", | ||
| 106 | + help="Comma-separated concurrency levels for extended suite; default 1,2,4,8,16,64", | ||
| 107 | + ) | ||
| 108 | + parser.add_argument( | ||
| 109 | + "--serial-items-per-case", | ||
| 110 | + type=int, | ||
| 111 | + default=512, | ||
| 112 | + help="Items per batch-size case in extended suite", | ||
| 113 | + ) | ||
| 114 | + parser.add_argument( | ||
| 115 | + "--concurrency-requests-per-case", | ||
| 116 | + type=int, | ||
| 117 | + default=128, | ||
| 118 | + help="Requests per concurrency or matrix case in extended suite", | ||
| 119 | + ) | ||
| 120 | + parser.add_argument( | ||
| 121 | + "--concurrency-batch-size", | ||
| 122 | + type=int, | ||
| 123 | + default=1, | ||
| 124 | + help="Batch size used by the dedicated concurrency sweep", | ||
| 125 | + ) | ||
| 126 | + parser.add_argument( | ||
| 127 | + "--max-batch-concurrency-product", | ||
| 128 | + type=int, | ||
| 129 | + default=128, | ||
| 130 | + help="Skip matrix cases where batch_size * concurrency exceeds this value; 0 disables the limit", | ||
| 131 | + ) | ||
| 87 | return parser.parse_args() | 132 | return parser.parse_args() |
| 88 | 133 | ||
| 89 | 134 | ||
| 135 | +def parse_csv_ints(raw: str, fallback: Sequence[int]) -> List[int]: | ||
| 136 | + if not raw.strip(): | ||
| 137 | + return list(fallback) | ||
| 138 | + values: List[int] = [] | ||
| 139 | + for item in raw.split(","): | ||
| 140 | + stripped = item.strip() | ||
| 141 | + if not stripped: | ||
| 142 | + continue | ||
| 143 | + value = int(stripped) | ||
| 144 | + if value <= 0: | ||
| 145 | + raise ValueError(f"Expected positive integer, got {value}") | ||
| 146 | + values.append(value) | ||
| 147 | + if not values: | ||
| 148 | + raise ValueError("Parsed empty integer list") | ||
| 149 | + return values | ||
| 150 | + | ||
| 151 | + | ||
| 90 | def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: | 152 | def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: |
| 91 | texts: List[str] = [] | 153 | texts: List[str] = [] |
| 92 | with csv_path.open("r", encoding="utf-8") as handle: | 154 | with csv_path.open("r", encoding="utf-8") as handle: |
| @@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: | @@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: | ||
| 102 | return texts | 164 | return texts |
| 103 | 165 | ||
| 104 | 166 | ||
| 105 | -def batched(values: List[str], batch_size: int) -> Iterable[List[str]]: | 167 | +def batched(values: Sequence[str], batch_size: int) -> Iterable[List[str]]: |
| 106 | for start in range(0, len(values), batch_size): | 168 | for start in range(0, len(values), batch_size): |
| 107 | - yield values[start:start + batch_size] | 169 | + yield list(values[start:start + batch_size]) |
| 108 | 170 | ||
| 109 | 171 | ||
| 110 | def percentile(values: List[float], p: float) -> float: | 172 | def percentile(values: List[float], p: float) -> float: |
| @@ -148,15 +210,34 @@ def build_environment_info() -> Dict[str, Any]: | @@ -148,15 +210,34 @@ def build_environment_info() -> Dict[str, Any]: | ||
| 148 | } | 210 | } |
| 149 | 211 | ||
| 150 | 212 | ||
| 151 | -def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 152 | - csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) | 213 | +def scenario_from_args(args: argparse.Namespace) -> Dict[str, str]: |
| 214 | + return { | ||
| 215 | + "name": f"{args.model} {args.source_lang}->{args.target_lang}", | ||
| 216 | + "model": args.model, | ||
| 217 | + "source_lang": args.source_lang, | ||
| 218 | + "target_lang": args.target_lang, | ||
| 219 | + "column": args.column, | ||
| 220 | + "scene": args.scene, | ||
| 221 | + } | ||
| 222 | + | ||
| 223 | + | ||
| 224 | +def build_config_and_capability( | ||
| 225 | + args: argparse.Namespace, | ||
| 226 | + *, | ||
| 227 | + batch_size_override: int | None = None, | ||
| 228 | +) -> tuple[Dict[str, Any], Dict[str, Any]]: | ||
| 153 | config = copy.deepcopy(get_translation_config()) | 229 | config = copy.deepcopy(get_translation_config()) |
| 230 | + for name, cfg in config["capabilities"].items(): | ||
| 231 | + cfg["enabled"] = name == args.model | ||
| 232 | + config["default_model"] = args.model | ||
| 154 | capability = get_translation_capability(config, args.model, require_enabled=False) | 233 | capability = get_translation_capability(config, args.model, require_enabled=False) |
| 155 | if args.device_override: | 234 | if args.device_override: |
| 156 | capability["device"] = args.device_override | 235 | capability["device"] = args.device_override |
| 157 | if args.torch_dtype_override: | 236 | if args.torch_dtype_override: |
| 158 | capability["torch_dtype"] = args.torch_dtype_override | 237 | capability["torch_dtype"] = args.torch_dtype_override |
| 159 | - if args.batch_size: | 238 | + if batch_size_override is not None: |
| 239 | + capability["batch_size"] = batch_size_override | ||
| 240 | + elif args.batch_size: | ||
| 160 | capability["batch_size"] = args.batch_size | 241 | capability["batch_size"] = args.batch_size |
| 161 | if args.max_new_tokens: | 242 | if args.max_new_tokens: |
| 162 | capability["max_new_tokens"] = args.max_new_tokens | 243 | capability["max_new_tokens"] = args.max_new_tokens |
| @@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | @@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 164 | capability["num_beams"] = args.num_beams | 245 | capability["num_beams"] = args.num_beams |
| 165 | if args.attn_implementation: | 246 | if args.attn_implementation: |
| 166 | capability["attn_implementation"] = args.attn_implementation | 247 | capability["attn_implementation"] = args.attn_implementation |
| 248 | + if args.disable_cache: | ||
| 249 | + capability["use_cache"] = False | ||
| 167 | config["capabilities"][args.model] = capability | 250 | config["capabilities"][args.model] = capability |
| 168 | - configured_batch_size = int(capability.get("batch_size") or 1) | ||
| 169 | - batch_size = configured_batch_size | ||
| 170 | - texts = load_texts(csv_path, args.column, args.limit) | 251 | + return config, capability |
| 171 | 252 | ||
| 172 | - service = TranslationService(config) | 253 | + |
| 254 | +def ensure_cuda_stats_reset() -> None: | ||
| 173 | if torch.cuda.is_available(): | 255 | if torch.cuda.is_available(): |
| 174 | torch.cuda.empty_cache() | 256 | torch.cuda.empty_cache() |
| 175 | torch.cuda.reset_peak_memory_stats() | 257 | torch.cuda.reset_peak_memory_stats() |
| 176 | 258 | ||
| 177 | - load_start = time.perf_counter() | ||
| 178 | - backend = service.get_backend(args.model) | ||
| 179 | - load_seconds = time.perf_counter() - load_start | ||
| 180 | 259 | ||
| 181 | - warmup_batches = min(max(args.warmup_batches, 0), max(1, math.ceil(len(texts) / batch_size))) | ||
| 182 | - for batch in list(batched(texts, batch_size))[:warmup_batches]: | 260 | +def build_memory_metrics() -> Dict[str, Any]: |
| 261 | + peak_gpu_mem_gb = None | ||
| 262 | + peak_gpu_reserved_gb = None | ||
| 263 | + if torch.cuda.is_available(): | ||
| 264 | + peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3) | ||
| 265 | + peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) | ||
| 266 | + max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2) | ||
| 267 | + return { | ||
| 268 | + "max_rss_mb": max_rss_mb, | ||
| 269 | + "peak_gpu_memory_gb": peak_gpu_mem_gb, | ||
| 270 | + "peak_gpu_reserved_gb": peak_gpu_reserved_gb, | ||
| 271 | + } | ||
| 272 | + | ||
| 273 | + | ||
| 274 | +def make_request_payload(batch: Sequence[str]) -> str | List[str]: | ||
| 275 | + if len(batch) == 1: | ||
| 276 | + return batch[0] | ||
| 277 | + return list(batch) | ||
| 278 | + | ||
| 279 | + | ||
| 280 | +def benchmark_serial_case( | ||
| 281 | + *, | ||
| 282 | + service: TranslationService, | ||
| 283 | + backend: Any, | ||
| 284 | + scenario: Dict[str, str], | ||
| 285 | + capability: Dict[str, Any], | ||
| 286 | + texts: List[str], | ||
| 287 | + batch_size: int, | ||
| 288 | + warmup_batches: int, | ||
| 289 | +) -> Dict[str, Any]: | ||
| 290 | + backend.batch_size = batch_size | ||
| 291 | + measured_batches = list(batched(texts, batch_size)) | ||
| 292 | + warmup_count = min(max(warmup_batches, 0), len(measured_batches)) | ||
| 293 | + | ||
| 294 | + for batch in measured_batches[:warmup_count]: | ||
| 183 | service.translate( | 295 | service.translate( |
| 184 | - text=batch, | ||
| 185 | - source_lang=args.source_lang, | ||
| 186 | - target_lang=args.target_lang, | ||
| 187 | - model=args.model, | ||
| 188 | - scene=args.scene, | 296 | + text=make_request_payload(batch), |
| 297 | + source_lang=scenario["source_lang"], | ||
| 298 | + target_lang=scenario["target_lang"], | ||
| 299 | + model=scenario["model"], | ||
| 300 | + scene=scenario["scene"], | ||
| 189 | ) | 301 | ) |
| 190 | 302 | ||
| 191 | batch_latencies_ms: List[float] = [] | 303 | batch_latencies_ms: List[float] = [] |
| @@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | @@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 193 | failure_count = 0 | 305 | failure_count = 0 |
| 194 | output_chars = 0 | 306 | output_chars = 0 |
| 195 | total_input_chars = sum(len(text) for text in texts) | 307 | total_input_chars = sum(len(text) for text in texts) |
| 196 | - measured_batches = list(batched(texts, batch_size)) | ||
| 197 | 308 | ||
| 198 | start = time.perf_counter() | 309 | start = time.perf_counter() |
| 199 | for batch in measured_batches: | 310 | for batch in measured_batches: |
| 200 | batch_start = time.perf_counter() | 311 | batch_start = time.perf_counter() |
| 201 | outputs = service.translate( | 312 | outputs = service.translate( |
| 202 | - text=batch, | ||
| 203 | - source_lang=args.source_lang, | ||
| 204 | - target_lang=args.target_lang, | ||
| 205 | - model=args.model, | ||
| 206 | - scene=args.scene, | 313 | + text=make_request_payload(batch), |
| 314 | + source_lang=scenario["source_lang"], | ||
| 315 | + target_lang=scenario["target_lang"], | ||
| 316 | + model=scenario["model"], | ||
| 317 | + scene=scenario["scene"], | ||
| 207 | ) | 318 | ) |
| 208 | elapsed_ms = (time.perf_counter() - batch_start) * 1000 | 319 | elapsed_ms = (time.perf_counter() - batch_start) * 1000 |
| 209 | batch_latencies_ms.append(elapsed_ms) | 320 | batch_latencies_ms.append(elapsed_ms) |
| 210 | 321 | ||
| 211 | - if not isinstance(outputs, list): | ||
| 212 | - raise RuntimeError(f"Expected list output for batch translation, got {type(outputs)!r}") | ||
| 213 | - for item in outputs: | 322 | + if isinstance(outputs, list): |
| 323 | + result_items = outputs | ||
| 324 | + else: | ||
| 325 | + result_items = [outputs] | ||
| 326 | + for item in result_items: | ||
| 214 | if item is None: | 327 | if item is None: |
| 215 | failure_count += 1 | 328 | failure_count += 1 |
| 216 | else: | 329 | else: |
| 217 | success_count += 1 | 330 | success_count += 1 |
| 218 | output_chars += len(item) | 331 | output_chars += len(item) |
| 219 | translate_seconds = time.perf_counter() - start | 332 | translate_seconds = time.perf_counter() - start |
| 333 | + total_items = len(texts) | ||
| 334 | + memory = build_memory_metrics() | ||
| 220 | 335 | ||
| 221 | - peak_gpu_mem_gb = None | ||
| 222 | - peak_gpu_reserved_gb = None | ||
| 223 | - if torch.cuda.is_available(): | ||
| 224 | - peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3) | ||
| 225 | - peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) | 336 | + return { |
| 337 | + "mode": "serial_batch", | ||
| 338 | + "batch_size": batch_size, | ||
| 339 | + "concurrency": 1, | ||
| 340 | + "rows": total_items, | ||
| 341 | + "requests": len(measured_batches), | ||
| 342 | + "input_chars": total_input_chars, | ||
| 343 | + "load_seconds": 0.0, | ||
| 344 | + "translate_seconds": round(translate_seconds, 4), | ||
| 345 | + "total_seconds": round(translate_seconds, 4), | ||
| 346 | + "batch_count": len(batch_latencies_ms), | ||
| 347 | + "request_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2), | ||
| 348 | + "request_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2), | ||
| 349 | + "request_latency_max_ms": round(max(batch_latencies_ms), 2), | ||
| 350 | + "avg_request_latency_ms": round(statistics.fmean(batch_latencies_ms), 2), | ||
| 351 | + "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3), | ||
| 352 | + "requests_per_second": round(len(measured_batches) / translate_seconds, 2), | ||
| 353 | + "items_per_second": round(total_items / translate_seconds, 2), | ||
| 354 | + "input_chars_per_second": round(total_input_chars / translate_seconds, 2), | ||
| 355 | + "output_chars_per_second": round(output_chars / translate_seconds, 2), | ||
| 356 | + "success_count": success_count, | ||
| 357 | + "failure_count": failure_count, | ||
| 358 | + "success_rate": round(success_count / total_items, 6), | ||
| 359 | + "device": str(getattr(backend, "device", capability.get("device", "unknown"))), | ||
| 360 | + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), | ||
| 361 | + "configured_batch_size": int(capability.get("batch_size") or batch_size), | ||
| 362 | + "used_batch_size": batch_size, | ||
| 363 | + "warmup_batches": warmup_count, | ||
| 364 | + **memory, | ||
| 365 | + } | ||
| 226 | 366 | ||
| 227 | - max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2) | ||
| 228 | - total_items = len(texts) | 367 | + |
| 368 | +def benchmark_concurrency_case( | ||
| 369 | + *, | ||
| 370 | + service: TranslationService, | ||
| 371 | + backend: Any, | ||
| 372 | + scenario: Dict[str, str], | ||
| 373 | + capability: Dict[str, Any], | ||
| 374 | + texts: List[str], | ||
| 375 | + batch_size: int, | ||
| 376 | + concurrency: int, | ||
| 377 | + requests_per_case: int, | ||
| 378 | + warmup_batches: int, | ||
| 379 | +) -> Dict[str, Any]: | ||
| 380 | + backend.batch_size = batch_size | ||
| 381 | + required_items = batch_size * requests_per_case | ||
| 382 | + case_texts = texts[:required_items] | ||
| 383 | + request_batches = list(batched(case_texts, batch_size)) | ||
| 384 | + if not request_batches: | ||
| 385 | + raise ValueError("No request batches prepared for concurrency benchmark") | ||
| 386 | + warmup_count = min(max(warmup_batches, 0), len(request_batches)) | ||
| 387 | + | ||
| 388 | + for batch in request_batches[:warmup_count]: | ||
| 389 | + service.translate( | ||
| 390 | + text=make_request_payload(batch), | ||
| 391 | + source_lang=scenario["source_lang"], | ||
| 392 | + target_lang=scenario["target_lang"], | ||
| 393 | + model=scenario["model"], | ||
| 394 | + scene=scenario["scene"], | ||
| 395 | + ) | ||
| 396 | + | ||
| 397 | + request_latencies_ms: List[float] = [] | ||
| 398 | + success_count = 0 | ||
| 399 | + failure_count = 0 | ||
| 400 | + output_chars = 0 | ||
| 401 | + total_input_chars = sum(len(text) for text in case_texts) | ||
| 402 | + | ||
| 403 | + def worker(batch: List[str]) -> tuple[float, int, int, int]: | ||
| 404 | + started = time.perf_counter() | ||
| 405 | + outputs = service.translate( | ||
| 406 | + text=make_request_payload(batch), | ||
| 407 | + source_lang=scenario["source_lang"], | ||
| 408 | + target_lang=scenario["target_lang"], | ||
| 409 | + model=scenario["model"], | ||
| 410 | + scene=scenario["scene"], | ||
| 411 | + ) | ||
| 412 | + elapsed_ms = (time.perf_counter() - started) * 1000 | ||
| 413 | + if isinstance(outputs, list): | ||
| 414 | + result_items = outputs | ||
| 415 | + else: | ||
| 416 | + result_items = [outputs] | ||
| 417 | + local_success = 0 | ||
| 418 | + local_failure = 0 | ||
| 419 | + local_output_chars = 0 | ||
| 420 | + for item in result_items: | ||
| 421 | + if item is None: | ||
| 422 | + local_failure += 1 | ||
| 423 | + else: | ||
| 424 | + local_success += 1 | ||
| 425 | + local_output_chars += len(item) | ||
| 426 | + return elapsed_ms, local_success, local_failure, local_output_chars | ||
| 427 | + | ||
| 428 | + wall_start = time.perf_counter() | ||
| 429 | + with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor: | ||
| 430 | + futures = [executor.submit(worker, batch) for batch in request_batches] | ||
| 431 | + for future in concurrent.futures.as_completed(futures): | ||
| 432 | + latency_ms, local_success, local_failure, local_output_chars = future.result() | ||
| 433 | + request_latencies_ms.append(latency_ms) | ||
| 434 | + success_count += local_success | ||
| 435 | + failure_count += local_failure | ||
| 436 | + output_chars += local_output_chars | ||
| 437 | + wall_seconds = time.perf_counter() - wall_start | ||
| 438 | + total_items = len(case_texts) | ||
| 439 | + memory = build_memory_metrics() | ||
| 229 | 440 | ||
| 230 | return { | 441 | return { |
| 231 | - "scenario": { | ||
| 232 | - "name": f"{args.model} {args.source_lang}->{args.target_lang}", | ||
| 233 | - "model": args.model, | ||
| 234 | - "source_lang": args.source_lang, | ||
| 235 | - "target_lang": args.target_lang, | ||
| 236 | - "column": args.column, | ||
| 237 | - "scene": args.scene, | 442 | + "mode": "concurrency", |
| 443 | + "batch_size": batch_size, | ||
| 444 | + "concurrency": concurrency, | ||
| 445 | + "rows": total_items, | ||
| 446 | + "requests": len(request_batches), | ||
| 447 | + "input_chars": total_input_chars, | ||
| 448 | + "load_seconds": 0.0, | ||
| 449 | + "translate_seconds": round(wall_seconds, 4), | ||
| 450 | + "total_seconds": round(wall_seconds, 4), | ||
| 451 | + "batch_count": len(request_latencies_ms), | ||
| 452 | + "request_latency_p50_ms": round(percentile(request_latencies_ms, 0.50), 2), | ||
| 453 | + "request_latency_p95_ms": round(percentile(request_latencies_ms, 0.95), 2), | ||
| 454 | + "request_latency_max_ms": round(max(request_latencies_ms), 2), | ||
| 455 | + "avg_request_latency_ms": round(statistics.fmean(request_latencies_ms), 2), | ||
| 456 | + "avg_item_latency_ms": round((wall_seconds / total_items) * 1000, 3), | ||
| 457 | + "requests_per_second": round(len(request_batches) / wall_seconds, 2), | ||
| 458 | + "items_per_second": round(total_items / wall_seconds, 2), | ||
| 459 | + "input_chars_per_second": round(total_input_chars / wall_seconds, 2), | ||
| 460 | + "output_chars_per_second": round(output_chars / wall_seconds, 2), | ||
| 461 | + "success_count": success_count, | ||
| 462 | + "failure_count": failure_count, | ||
| 463 | + "success_rate": round(success_count / total_items, 6), | ||
| 464 | + "device": str(getattr(backend, "device", capability.get("device", "unknown"))), | ||
| 465 | + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), | ||
| 466 | + "configured_batch_size": int(capability.get("batch_size") or batch_size), | ||
| 467 | + "used_batch_size": batch_size, | ||
| 468 | + "warmup_batches": warmup_count, | ||
| 469 | + **memory, | ||
| 470 | + } | ||
| 471 | + | ||
| 472 | + | ||
| 473 | +def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 474 | + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) | ||
| 475 | + scenario = scenario_from_args(args) | ||
| 476 | + config, capability = build_config_and_capability(args) | ||
| 477 | + configured_batch_size = int(capability.get("batch_size") or 1) | ||
| 478 | + batch_size = configured_batch_size | ||
| 479 | + texts = load_texts(csv_path, args.column, args.limit) | ||
| 480 | + | ||
| 481 | + ensure_cuda_stats_reset() | ||
| 482 | + load_start = time.perf_counter() | ||
| 483 | + service = TranslationService(config) | ||
| 484 | + backend = service.get_backend(args.model) | ||
| 485 | + load_seconds = time.perf_counter() - load_start | ||
| 486 | + | ||
| 487 | + runtime = benchmark_serial_case( | ||
| 488 | + service=service, | ||
| 489 | + backend=backend, | ||
| 490 | + scenario=scenario, | ||
| 491 | + capability=capability, | ||
| 492 | + texts=texts, | ||
| 493 | + batch_size=batch_size, | ||
| 494 | + warmup_batches=args.warmup_batches, | ||
| 495 | + ) | ||
| 496 | + runtime["load_seconds"] = round(load_seconds, 4) | ||
| 497 | + runtime["total_seconds"] = round(runtime["load_seconds"] + runtime["translate_seconds"], 4) | ||
| 498 | + | ||
| 499 | + return { | ||
| 500 | + "scenario": scenario, | ||
| 501 | + "dataset": { | ||
| 502 | + "csv_path": str(csv_path), | ||
| 503 | + "rows": len(texts), | ||
| 504 | + "input_chars": sum(len(text) for text in texts), | ||
| 238 | }, | 505 | }, |
| 506 | + "runtime": runtime, | ||
| 507 | + } | ||
| 508 | + | ||
| 509 | + | ||
| 510 | +def benchmark_extended_scenario(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 511 | + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) | ||
| 512 | + scenario = scenario_from_args(args) | ||
| 513 | + batch_sizes = parse_csv_ints(args.batch_size_list, DEFAULT_BATCH_SIZES) | ||
| 514 | + concurrencies = parse_csv_ints(args.concurrency_list, DEFAULT_CONCURRENCIES) | ||
| 515 | + largest_batch = max(batch_sizes + [args.concurrency_batch_size]) | ||
| 516 | + largest_concurrency = max(concurrencies) | ||
| 517 | + max_product = args.max_batch_concurrency_product | ||
| 518 | + required_items = max( | ||
| 519 | + args.limit or 0, | ||
| 520 | + max(args.serial_items_per_case, largest_batch), | ||
| 521 | + args.concurrency_requests_per_case * args.concurrency_batch_size, | ||
| 522 | + largest_batch * args.concurrency_requests_per_case, | ||
| 523 | + ) | ||
| 524 | + texts = load_texts(csv_path, args.column, required_items) | ||
| 525 | + config, capability = build_config_and_capability(args) | ||
| 526 | + | ||
| 527 | + ensure_cuda_stats_reset() | ||
| 528 | + load_start = time.perf_counter() | ||
| 529 | + service = TranslationService(config) | ||
| 530 | + backend = service.get_backend(args.model) | ||
| 531 | + load_seconds = time.perf_counter() - load_start | ||
| 532 | + | ||
| 533 | + batch_sweep: List[Dict[str, Any]] = [] | ||
| 534 | + concurrency_sweep: List[Dict[str, Any]] = [] | ||
| 535 | + matrix_results: List[Dict[str, Any]] = [] | ||
| 536 | + | ||
| 537 | + for batch_size in batch_sizes: | ||
| 538 | + case_texts = texts[: max(batch_size, args.serial_items_per_case)] | ||
| 539 | + batch_sweep.append( | ||
| 540 | + benchmark_serial_case( | ||
| 541 | + service=service, | ||
| 542 | + backend=backend, | ||
| 543 | + scenario=scenario, | ||
| 544 | + capability=capability, | ||
| 545 | + texts=case_texts, | ||
| 546 | + batch_size=batch_size, | ||
| 547 | + warmup_batches=args.warmup_batches, | ||
| 548 | + ) | ||
| 549 | + ) | ||
| 550 | + | ||
| 551 | + for concurrency in concurrencies: | ||
| 552 | + concurrency_sweep.append( | ||
| 553 | + benchmark_concurrency_case( | ||
| 554 | + service=service, | ||
| 555 | + backend=backend, | ||
| 556 | + scenario=scenario, | ||
| 557 | + capability=capability, | ||
| 558 | + texts=texts, | ||
| 559 | + batch_size=args.concurrency_batch_size, | ||
| 560 | + concurrency=concurrency, | ||
| 561 | + requests_per_case=args.concurrency_requests_per_case, | ||
| 562 | + warmup_batches=args.warmup_batches, | ||
| 563 | + ) | ||
| 564 | + ) | ||
| 565 | + | ||
| 566 | + for batch_size in batch_sizes: | ||
| 567 | + for concurrency in concurrencies: | ||
| 568 | + if max_product > 0 and batch_size * concurrency > max_product: | ||
| 569 | + continue | ||
| 570 | + matrix_results.append( | ||
| 571 | + benchmark_concurrency_case( | ||
| 572 | + service=service, | ||
| 573 | + backend=backend, | ||
| 574 | + scenario=scenario, | ||
| 575 | + capability=capability, | ||
| 576 | + texts=texts, | ||
| 577 | + batch_size=batch_size, | ||
| 578 | + concurrency=concurrency, | ||
| 579 | + requests_per_case=args.concurrency_requests_per_case, | ||
| 580 | + warmup_batches=args.warmup_batches, | ||
| 581 | + ) | ||
| 582 | + ) | ||
| 583 | + | ||
| 584 | + for collection in (batch_sweep, concurrency_sweep, matrix_results): | ||
| 585 | + for idx, item in enumerate(collection): | ||
| 586 | + item["load_seconds"] = round(load_seconds if idx == 0 else 0.0, 4) | ||
| 587 | + item["total_seconds"] = round(item["load_seconds"] + item["translate_seconds"], 4) | ||
| 588 | + | ||
| 589 | + return { | ||
| 590 | + "scenario": scenario, | ||
| 239 | "dataset": { | 591 | "dataset": { |
| 240 | "csv_path": str(csv_path), | 592 | "csv_path": str(csv_path), |
| 241 | - "rows": total_items, | ||
| 242 | - "input_chars": total_input_chars, | 593 | + "rows_loaded": len(texts), |
| 594 | + }, | ||
| 595 | + "config": { | ||
| 596 | + "batch_sizes": batch_sizes, | ||
| 597 | + "concurrencies": concurrencies, | ||
| 598 | + "serial_items_per_case": args.serial_items_per_case, | ||
| 599 | + "concurrency_requests_per_case": args.concurrency_requests_per_case, | ||
| 600 | + "concurrency_batch_size": args.concurrency_batch_size, | ||
| 601 | + "max_batch_concurrency_product": max_product, | ||
| 602 | + "cache_disabled": bool(args.disable_cache), | ||
| 243 | }, | 603 | }, |
| 244 | - "runtime": { | 604 | + "runtime_defaults": { |
| 245 | "device": str(getattr(backend, "device", capability.get("device", "unknown"))), | 605 | "device": str(getattr(backend, "device", capability.get("device", "unknown"))), |
| 246 | "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), | 606 | "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), |
| 247 | - "configured_batch_size": configured_batch_size, | ||
| 248 | - "used_batch_size": batch_size, | ||
| 249 | - "warmup_batches": warmup_batches, | 607 | + "configured_batch_size": int(capability.get("batch_size") or 1), |
| 250 | "load_seconds": round(load_seconds, 4), | 608 | "load_seconds": round(load_seconds, 4), |
| 251 | - "translate_seconds": round(translate_seconds, 4), | ||
| 252 | - "total_seconds": round(load_seconds + translate_seconds, 4), | ||
| 253 | - "batch_count": len(batch_latencies_ms), | ||
| 254 | - "first_batch_ms": round(batch_latencies_ms[0], 2), | ||
| 255 | - "batch_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2), | ||
| 256 | - "batch_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2), | ||
| 257 | - "batch_latency_max_ms": round(max(batch_latencies_ms), 2), | ||
| 258 | - "avg_batch_latency_ms": round(statistics.fmean(batch_latencies_ms), 2), | ||
| 259 | - "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3), | ||
| 260 | - "items_per_second": round(total_items / translate_seconds, 2), | ||
| 261 | - "input_chars_per_second": round(total_input_chars / translate_seconds, 2), | ||
| 262 | - "output_chars_per_second": round(output_chars / translate_seconds, 2), | ||
| 263 | - "success_count": success_count, | ||
| 264 | - "failure_count": failure_count, | ||
| 265 | - "success_rate": round(success_count / total_items, 6), | ||
| 266 | - "max_rss_mb": max_rss_mb, | ||
| 267 | - "peak_gpu_memory_gb": peak_gpu_mem_gb, | ||
| 268 | - "peak_gpu_reserved_gb": peak_gpu_reserved_gb, | ||
| 269 | }, | 609 | }, |
| 610 | + "batch_sweep": batch_sweep, | ||
| 611 | + "concurrency_sweep": concurrency_sweep, | ||
| 612 | + "matrix": matrix_results, | ||
| 270 | } | 613 | } |
| 271 | 614 | ||
| 272 | 615 | ||
| 273 | def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | 616 | def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: |
| 274 | report = { | 617 | report = { |
| 275 | "generated_at": datetime.now().isoformat(timespec="seconds"), | 618 | "generated_at": datetime.now().isoformat(timespec="seconds"), |
| 619 | + "suite": args.suite, | ||
| 276 | "environment": build_environment_info(), | 620 | "environment": build_environment_info(), |
| 277 | "scenarios": [], | 621 | "scenarios": [], |
| 278 | } | 622 | } |
| @@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | @@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 296 | scenario["scene"], | 640 | scenario["scene"], |
| 297 | "--warmup-batches", | 641 | "--warmup-batches", |
| 298 | str(args.warmup_batches), | 642 | str(args.warmup_batches), |
| 643 | + "--suite", | ||
| 644 | + args.suite, | ||
| 645 | + "--serial-items-per-case", | ||
| 646 | + str(args.serial_items_per_case), | ||
| 647 | + "--concurrency-requests-per-case", | ||
| 648 | + str(args.concurrency_requests_per_case), | ||
| 649 | + "--concurrency-batch-size", | ||
| 650 | + str(args.concurrency_batch_size), | ||
| 651 | + "--max-batch-concurrency-product", | ||
| 652 | + str(args.max_batch_concurrency_product), | ||
| 299 | ] | 653 | ] |
| 300 | if args.limit: | 654 | if args.limit: |
| 301 | cmd.extend(["--limit", str(args.limit)]) | 655 | cmd.extend(["--limit", str(args.limit)]) |
| 302 | if args.batch_size: | 656 | if args.batch_size: |
| 303 | cmd.extend(["--batch-size", str(args.batch_size)]) | 657 | cmd.extend(["--batch-size", str(args.batch_size)]) |
| 658 | + if args.batch_size_list: | ||
| 659 | + cmd.extend(["--batch-size-list", args.batch_size_list]) | ||
| 660 | + if args.concurrency_list: | ||
| 661 | + cmd.extend(["--concurrency-list", args.concurrency_list]) | ||
| 304 | if args.device_override: | 662 | if args.device_override: |
| 305 | cmd.extend(["--device-override", args.device_override]) | 663 | cmd.extend(["--device-override", args.device_override]) |
| 306 | if args.torch_dtype_override: | 664 | if args.torch_dtype_override: |
| @@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | @@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 311 | cmd.extend(["--num-beams", str(args.num_beams)]) | 669 | cmd.extend(["--num-beams", str(args.num_beams)]) |
| 312 | if args.attn_implementation: | 670 | if args.attn_implementation: |
| 313 | cmd.extend(["--attn-implementation", args.attn_implementation]) | 671 | cmd.extend(["--attn-implementation", args.attn_implementation]) |
| 672 | + if args.disable_cache: | ||
| 673 | + cmd.append("--disable-cache") | ||
| 314 | 674 | ||
| 315 | completed = subprocess.run(cmd, capture_output=True, text=True, check=True) | 675 | completed = subprocess.run(cmd, capture_output=True, text=True, check=True) |
| 316 | result_line = "" | 676 | result_line = "" |
| @@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | @@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: | ||
| 327 | return report | 687 | return report |
| 328 | 688 | ||
| 329 | 689 | ||
| 330 | -def render_markdown_report(report: Dict[str, Any]) -> str: | 690 | +def render_baseline_markdown_report(report: Dict[str, Any]) -> str: |
| 331 | lines = [ | 691 | lines = [ |
| 332 | "# Local Translation Model Benchmark", | 692 | "# Local Translation Model Benchmark", |
| 333 | "", | 693 | "", |
| 334 | f"- Generated at: `{report['generated_at']}`", | 694 | f"- Generated at: `{report['generated_at']}`", |
| 695 | + f"- Suite: `{report['suite']}`", | ||
| 335 | f"- Python: `{report['environment']['python']}`", | 696 | f"- Python: `{report['environment']['python']}`", |
| 336 | f"- Torch: `{report['environment']['torch']}`", | 697 | f"- Torch: `{report['environment']['torch']}`", |
| 337 | f"- Transformers: `{report['environment']['transformers']}`", | 698 | f"- Transformers: `{report['environment']['transformers']}`", |
| @@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -> str: | @@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -> str: | ||
| 342 | lines.extend( | 703 | lines.extend( |
| 343 | [ | 704 | [ |
| 344 | "", | 705 | "", |
| 345 | - "| Scenario | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Load s | Peak GPU GiB | Success |", | 706 | + "| Scenario | Items/s | Avg item ms | Req p50 ms | Req p95 ms | Load s | Peak GPU GiB | Success |", |
| 346 | "|---|---:|---:|---:|---:|---:|---:|---:|", | 707 | "|---|---:|---:|---:|---:|---:|---:|---:|", |
| 347 | ] | 708 | ] |
| 348 | ) | 709 | ) |
| 349 | for item in report["scenarios"]: | 710 | for item in report["scenarios"]: |
| 350 | runtime = item["runtime"] | 711 | runtime = item["runtime"] |
| 351 | lines.append( | 712 | lines.append( |
| 352 | - "| {name} | {items_per_second} | {avg_item_latency_ms} | {batch_latency_p50_ms} | {batch_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( | 713 | + "| {name} | {items_per_second} | {avg_item_latency_ms} | {request_latency_p50_ms} | {request_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( |
| 353 | name=item["scenario"]["name"], | 714 | name=item["scenario"]["name"], |
| 354 | items_per_second=runtime["items_per_second"], | 715 | items_per_second=runtime["items_per_second"], |
| 355 | avg_item_latency_ms=runtime["avg_item_latency_ms"], | 716 | avg_item_latency_ms=runtime["avg_item_latency_ms"], |
| 356 | - batch_latency_p50_ms=runtime["batch_latency_p50_ms"], | ||
| 357 | - batch_latency_p95_ms=runtime["batch_latency_p95_ms"], | 717 | + request_latency_p50_ms=runtime["request_latency_p50_ms"], |
| 718 | + request_latency_p95_ms=runtime["request_latency_p95_ms"], | ||
| 358 | load_seconds=runtime["load_seconds"], | 719 | load_seconds=runtime["load_seconds"], |
| 359 | peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"], | 720 | peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"], |
| 360 | success_rate=runtime["success_rate"], | 721 | success_rate=runtime["success_rate"], |
| @@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -> str: | @@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -> str: | ||
| 375 | f"- Load time: `{runtime['load_seconds']} s`", | 736 | f"- Load time: `{runtime['load_seconds']} s`", |
| 376 | f"- Translate time: `{runtime['translate_seconds']} s`", | 737 | f"- Translate time: `{runtime['translate_seconds']} s`", |
| 377 | f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`", | 738 | f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`", |
| 378 | - f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, batch p50 `{runtime['batch_latency_p50_ms']} ms`, batch p95 `{runtime['batch_latency_p95_ms']} ms`, batch max `{runtime['batch_latency_max_ms']} ms`", | 739 | + f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, req p50 `{runtime['request_latency_p50_ms']} ms`, req p95 `{runtime['request_latency_p95_ms']} ms`, req max `{runtime['request_latency_max_ms']} ms`", |
| 379 | f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`", | 740 | f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`", |
| 380 | f"- Success: `{runtime['success_count']}/{dataset['rows']}`", | 741 | f"- Success: `{runtime['success_count']}/{dataset['rows']}`", |
| 381 | "", | 742 | "", |
| @@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -> str: | @@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -> str: | ||
| 384 | return "\n".join(lines) | 745 | return "\n".join(lines) |
| 385 | 746 | ||
| 386 | 747 | ||
| 748 | +def render_case_table( | ||
| 749 | + title: str, | ||
| 750 | + rows: Sequence[Dict[str, Any]], | ||
| 751 | + *, | ||
| 752 | + include_batch: bool, | ||
| 753 | + include_concurrency: bool, | ||
| 754 | +) -> List[str]: | ||
| 755 | + headers = ["Rows", "Requests", "Items/s", "Req/s", "Avg req ms", "Req p50 ms", "Req p95 ms", "Peak GPU GiB"] | ||
| 756 | + prefix_headers: List[str] = [] | ||
| 757 | + if include_batch: | ||
| 758 | + prefix_headers.append("Batch") | ||
| 759 | + if include_concurrency: | ||
| 760 | + prefix_headers.append("Concurrency") | ||
| 761 | + headers = prefix_headers + headers | ||
| 762 | + lines = [f"### {title}", ""] | ||
| 763 | + lines.append("| " + " | ".join(headers) + " |") | ||
| 764 | + lines.append("|" + "|".join(["---:"] * len(headers)) + "|") | ||
| 765 | + for item in rows: | ||
| 766 | + values: List[str] = [] | ||
| 767 | + if include_batch: | ||
| 768 | + values.append(str(item["batch_size"])) | ||
| 769 | + if include_concurrency: | ||
| 770 | + values.append(str(item["concurrency"])) | ||
| 771 | + values.extend( | ||
| 772 | + [ | ||
| 773 | + str(item["rows"]), | ||
| 774 | + str(item["requests"]), | ||
| 775 | + str(item["items_per_second"]), | ||
| 776 | + str(item["requests_per_second"]), | ||
| 777 | + str(item["avg_request_latency_ms"]), | ||
| 778 | + str(item["request_latency_p50_ms"]), | ||
| 779 | + str(item["request_latency_p95_ms"]), | ||
| 780 | + str(item["peak_gpu_memory_gb"]), | ||
| 781 | + ] | ||
| 782 | + ) | ||
| 783 | + lines.append("| " + " | ".join(values) + " |") | ||
| 784 | + lines.append("") | ||
| 785 | + return lines | ||
| 786 | + | ||
| 787 | + | ||
| 788 | +def render_extended_markdown_report(report: Dict[str, Any]) -> str: | ||
| 789 | + lines = [ | ||
| 790 | + "# Local Translation Model Extended Benchmark", | ||
| 791 | + "", | ||
| 792 | + f"- Generated at: `{report['generated_at']}`", | ||
| 793 | + f"- Suite: `{report['suite']}`", | ||
| 794 | + f"- Python: `{report['environment']['python']}`", | ||
| 795 | + f"- Torch: `{report['environment']['torch']}`", | ||
| 796 | + f"- Transformers: `{report['environment']['transformers']}`", | ||
| 797 | + f"- CUDA: `{report['environment']['cuda_available']}`", | ||
| 798 | + ] | ||
| 799 | + if report["environment"]["gpu_name"]: | ||
| 800 | + lines.append(f"- GPU: `{report['environment']['gpu_name']}` ({report['environment']['gpu_total_mem_gb']} GiB)") | ||
| 801 | + | ||
| 802 | + lines.extend( | ||
| 803 | + [ | ||
| 804 | + "", | ||
| 805 | + "## Reading Guide", | ||
| 806 | + "", | ||
| 807 | + "- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.", | ||
| 808 | + "- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.", | ||
| 809 | + "- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.", | ||
| 810 | + "", | ||
| 811 | + ] | ||
| 812 | + ) | ||
| 813 | + | ||
| 814 | + for item in report["scenarios"]: | ||
| 815 | + lines.extend( | ||
| 816 | + [ | ||
| 817 | + f"## {item['scenario']['name']}", | ||
| 818 | + "", | ||
| 819 | + f"- Direction: `{item['scenario']['source_lang']} -> {item['scenario']['target_lang']}`", | ||
| 820 | + f"- Column: `{item['scenario']['column']}`", | ||
| 821 | + f"- Loaded rows: `{item['dataset']['rows_loaded']}`", | ||
| 822 | + f"- Load time: `{item['runtime_defaults']['load_seconds']} s`", | ||
| 823 | + f"- Device: `{item['runtime_defaults']['device']}`", | ||
| 824 | + f"- DType: `{item['runtime_defaults']['torch_dtype']}`", | ||
| 825 | + f"- Cache disabled: `{item['config']['cache_disabled']}`", | ||
| 826 | + "", | ||
| 827 | + ] | ||
| 828 | + ) | ||
| 829 | + lines.extend(render_case_table("Batch Sweep (`concurrency=1`)", item["batch_sweep"], include_batch=True, include_concurrency=False)) | ||
| 830 | + lines.extend( | ||
| 831 | + render_case_table( | ||
| 832 | + f"Concurrency Sweep (`batch_size={item['config']['concurrency_batch_size']}`)", | ||
| 833 | + item["concurrency_sweep"], | ||
| 834 | + include_batch=False, | ||
| 835 | + include_concurrency=True, | ||
| 836 | + ) | ||
| 837 | + ) | ||
| 838 | + lines.extend(render_case_table("Batch x Concurrency Matrix", item["matrix"], include_batch=True, include_concurrency=True)) | ||
| 839 | + return "\n".join(lines) | ||
| 840 | + | ||
| 841 | + | ||
| 842 | +def render_markdown_report(report: Dict[str, Any]) -> str: | ||
| 843 | + if report["suite"] == "extended": | ||
| 844 | + return render_extended_markdown_report(report) | ||
| 845 | + return render_baseline_markdown_report(report) | ||
| 846 | + | ||
| 847 | + | ||
| 387 | def main() -> None: | 848 | def main() -> None: |
| 388 | args = parse_args() | 849 | args = parse_args() |
| 389 | if args.single: | 850 | if args.single: |
| 390 | - result = benchmark_single_scenario(args) | 851 | + if args.suite == "extended": |
| 852 | + result = benchmark_extended_scenario(args) | ||
| 853 | + else: | ||
| 854 | + result = benchmark_single_scenario(args) | ||
| 391 | print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False)) | 855 | print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False)) |
| 392 | return | 856 | return |
| 393 | 857 | ||
| 394 | report = run_all_scenarios(args) | 858 | report = run_all_scenarios(args) |
| 395 | output_dir = resolve_output_dir(args.output_dir) | 859 | output_dir = resolve_output_dir(args.output_dir) |
| 396 | timestamp = datetime.now().strftime("%H%M%S") | 860 | timestamp = datetime.now().strftime("%H%M%S") |
| 397 | - json_path = output_dir / f"translation_local_models_{timestamp}.json" | ||
| 398 | - md_path = output_dir / f"translation_local_models_{timestamp}.md" | 861 | + suffix = "extended" if args.suite == "extended" else "baseline" |
| 862 | + json_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.json" | ||
| 863 | + md_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.md" | ||
| 399 | json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") | 864 | json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") |
| 400 | md_path.write_text(render_markdown_report(report), encoding="utf-8") | 865 | md_path.write_text(render_markdown_report(report), encoding="utf-8") |
| 401 | 866 | ||
| 402 | print(f"JSON report: {json_path}") | 867 | print(f"JSON report: {json_path}") |
| 403 | print(f"Markdown report: {md_path}") | 868 | print(f"Markdown report: {md_path}") |
| 404 | for item in report["scenarios"]: | 869 | for item in report["scenarios"]: |
| 405 | - runtime = item["runtime"] | ||
| 406 | - print( | ||
| 407 | - f"{item['scenario']['name']}: " | ||
| 408 | - f"{runtime['items_per_second']} items/s | " | ||
| 409 | - f"avg_item={runtime['avg_item_latency_ms']} ms | " | ||
| 410 | - f"p95_batch={runtime['batch_latency_p95_ms']} ms | " | ||
| 411 | - f"load={runtime['load_seconds']} s" | ||
| 412 | - ) | 870 | + if args.suite == "extended": |
| 871 | + best_batch = max(item["batch_sweep"], key=lambda x: x["items_per_second"]) | ||
| 872 | + best_concurrency = max(item["concurrency_sweep"], key=lambda x: x["items_per_second"]) | ||
| 873 | + print( | ||
| 874 | + f"{item['scenario']['name']}: " | ||
| 875 | + f"best_batch={best_batch['batch_size']} ({best_batch['items_per_second']} items/s) | " | ||
| 876 | + f"best_concurrency={best_concurrency['concurrency']} ({best_concurrency['items_per_second']} items/s @ batch={best_concurrency['batch_size']})" | ||
| 877 | + ) | ||
| 878 | + else: | ||
| 879 | + runtime = item["runtime"] | ||
| 880 | + print( | ||
| 881 | + f"{item['scenario']['name']}: " | ||
| 882 | + f"{runtime['items_per_second']} items/s | " | ||
| 883 | + f"avg_item={runtime['avg_item_latency_ms']} ms | " | ||
| 884 | + f"p95_req={runtime['request_latency_p95_ms']} ms | " | ||
| 885 | + f"load={runtime['load_seconds']} s" | ||
| 886 | + ) | ||
| 413 | 887 | ||
| 414 | 888 | ||
| 415 | if __name__ == "__main__": | 889 | if __name__ == "__main__": |
translation/README.md
| @@ -13,7 +13,7 @@ | @@ -13,7 +13,7 @@ | ||
| 13 | - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh) | 13 | - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh) |
| 14 | - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py) | 14 | - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py) |
| 15 | - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) | 15 | - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) |
| 16 | -- 性能报告:[`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) | 16 | +- 性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) |
| 17 | 17 | ||
| 18 | ## 1. 设计目标 | 18 | ## 1. 设计目标 |
| 19 | 19 | ||
| @@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \ | @@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \ | ||
| 530 | 数据集: | 530 | 数据集: |
| 531 | - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) | 531 | - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) |
| 532 | 532 | ||
| 533 | +最新报告: | ||
| 534 | +- 摘要:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) | ||
| 535 | +- 完整 Markdown:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) | ||
| 536 | +- 完整 JSON:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) | ||
| 537 | + | ||
| 538 | +### 10.1 先看哪组数据 | ||
| 539 | + | ||
| 540 | +这里把 3 类结果分开看,不再混在一张表里: | ||
| 541 | + | ||
| 542 | +- `batch_sweep` | ||
| 543 | + 固定 `concurrency=1`,只比较不同 `batch_size` 的单流批处理性能 | ||
| 544 | +- `concurrency_sweep` | ||
| 545 | + 固定 `batch_size=1`,看“单条请求”在不同并发下的延迟和吞吐 | ||
| 546 | +- `batch x concurrency matrix` | ||
| 547 | + 同时看 `batch_size` 和 `concurrency` 的交互效应;本轮限制为 `batch_size * concurrency <= 128` | ||
| 548 | + | ||
| 549 | +建议: | ||
| 550 | + | ||
| 551 | +- 看线上 query 翻译延迟:优先看 `concurrency_sweep` | ||
| 552 | +- 看离线批量翻译吞吐:优先看 `batch_sweep` | ||
| 553 | +- 看单 worker 服务容量边界:再看 `batch x concurrency matrix` | ||
| 554 | + | ||
| 555 | +### 10.2 本轮补测参数 | ||
| 556 | + | ||
| 557 | +测试时间:`2026-03-18` | ||
| 558 | + | ||
| 559 | +环境: | ||
| 560 | +- GPU:`Tesla T4 16GB` | ||
| 561 | +- Python env:`.venv-translator` | ||
| 562 | +- Torch / Transformers:`2.10.0+cu128 / 5.3.0` | ||
| 563 | + | ||
| 564 | +统一参数: | ||
| 565 | +- cache:关闭(`--disable-cache`),避免缓存命中干扰性能结果 | ||
| 566 | +- `batch_sweep`:每档 `256` items | ||
| 567 | +- `concurrency_sweep`:固定 `batch_size=1`,每档 `32` requests | ||
| 568 | +- `batch x concurrency matrix`:每档 `32` requests,且只保留 `batch_size * concurrency <= 128` | ||
| 569 | +- 预热:`1` batch | ||
| 570 | + | ||
| 533 | 复现命令: | 571 | 复现命令: |
| 534 | 572 | ||
| 535 | ```bash | 573 | ```bash |
| @@ -537,16 +575,36 @@ cd /data/saas-search | @@ -537,16 +575,36 @@ cd /data/saas-search | ||
| 537 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py | 575 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py |
| 538 | ``` | 576 | ``` |
| 539 | 577 | ||
| 540 | -单模型复现示例: | 578 | +本轮扩展压测复现命令: |
| 579 | + | ||
| 580 | +```bash | ||
| 581 | +cd /data/saas-search | ||
| 582 | +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | ||
| 583 | + --suite extended \ | ||
| 584 | + --disable-cache \ | ||
| 585 | + --serial-items-per-case 256 \ | ||
| 586 | + --concurrency-requests-per-case 32 \ | ||
| 587 | + --concurrency-batch-size 1 \ | ||
| 588 | + --output-dir perf_reports/20260318/translation_local_models | ||
| 589 | +``` | ||
| 590 | + | ||
| 591 | +单模型扩展压测示例: | ||
| 541 | 592 | ||
| 542 | ```bash | 593 | ```bash |
| 543 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | 594 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ |
| 544 | --single \ | 595 | --single \ |
| 596 | + --suite extended \ | ||
| 545 | --model opus-mt-zh-en \ | 597 | --model opus-mt-zh-en \ |
| 546 | --source-lang zh \ | 598 | --source-lang zh \ |
| 547 | --target-lang en \ | 599 | --target-lang en \ |
| 548 | --column title_cn \ | 600 | --column title_cn \ |
| 549 | - --scene sku_name | 601 | + --scene sku_name \ |
| 602 | + --disable-cache \ | ||
| 603 | + --batch-size-list 1,4,8,16,32,64 \ | ||
| 604 | + --concurrency-list 1,2,4,8,16,64 \ | ||
| 605 | + --serial-items-per-case 256 \ | ||
| 606 | + --concurrency-requests-per-case 32 \ | ||
| 607 | + --concurrency-batch-size 1 | ||
| 550 | ``` | 608 | ``` |
| 551 | 609 | ||
| 552 | 单条请求延迟复现: | 610 | 单条请求延迟复现: |
| @@ -554,37 +612,143 @@ cd /data/saas-search | @@ -554,37 +612,143 @@ cd /data/saas-search | ||
| 554 | ```bash | 612 | ```bash |
| 555 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | 613 | ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ |
| 556 | --single \ | 614 | --single \ |
| 615 | + --suite extended \ | ||
| 557 | --model nllb-200-distilled-600m \ | 616 | --model nllb-200-distilled-600m \ |
| 558 | --source-lang zh \ | 617 | --source-lang zh \ |
| 559 | --target-lang en \ | 618 | --target-lang en \ |
| 560 | --column title_cn \ | 619 | --column title_cn \ |
| 561 | --scene sku_name \ | 620 | --scene sku_name \ |
| 562 | - --batch-size 1 \ | ||
| 563 | - --limit 100 | 621 | + --disable-cache \ |
| 622 | + --batch-size-list 1 \ | ||
| 623 | + --concurrency-list 1,2,4,8,16,64 \ | ||
| 624 | + --serial-items-per-case 256 \ | ||
| 625 | + --concurrency-requests-per-case 32 \ | ||
| 626 | + --concurrency-batch-size 1 | ||
| 564 | ``` | 627 | ``` |
| 565 | 628 | ||
| 566 | -说明: | ||
| 567 | -- 对当前脚本和本地 backend 来说,“单条请求”可以直接等价理解为 `batch_size=1` | ||
| 568 | -- 此时脚本里的 `batch_latency_*`,就可以直接视为“单次请求延迟”指标 | ||
| 569 | -- 线上搜索 query 翻译更应该关注这组数据,而不是大 batch 吞吐 | 629 | +### 10.3 单流 batch 结果 |
| 630 | + | ||
| 631 | +这组只看 `concurrency=1`,不要把这里的 `request p95` 当作线上并发请求的 p95。 | ||
| 632 | + | ||
| 633 | +`nllb-200-distilled-600m zh -> en` | ||
| 634 | + | ||
| 635 | +| Batch | Items/s | Avg item ms | Req p95 ms | | ||
| 636 | +|---:|---:|---:|---:| | ||
| 637 | +| 1 | 2.91 | 343.488 | 616.27 | | ||
| 638 | +| 4 | 8.44 | 118.545 | 722.95 | | ||
| 639 | +| 8 | 14.85 | 67.335 | 728.47 | | ||
| 640 | +| 16 | 27.28 | 36.662 | 769.18 | | ||
| 641 | +| 32 | 38.6 | 25.908 | 1369.88 | | ||
| 642 | +| 64 | 58.3 | 17.152 | 1659.9 | | ||
| 643 | + | ||
| 644 | +`nllb-200-distilled-600m en -> zh` | ||
| 645 | + | ||
| 646 | +| Batch | Items/s | Avg item ms | Req p95 ms | | ||
| 647 | +|---:|---:|---:|---:| | ||
| 648 | +| 1 | 1.91 | 524.917 | 866.33 | | ||
| 649 | +| 4 | 4.94 | 202.473 | 1599.74 | | ||
| 650 | +| 8 | 8.25 | 121.188 | 1632.29 | | ||
| 651 | +| 16 | 13.52 | 73.956 | 1649.65 | | ||
| 652 | +| 32 | 21.27 | 47.017 | 1827.16 | | ||
| 653 | +| 64 | 32.64 | 30.641 | 2031.25 | | ||
| 570 | 654 | ||
| 571 | -当前单条请求实测(`Tesla T4`,`limit=100`): | ||
| 572 | -- `nllb-200-distilled-600m zh->en`:p50 约 `292.54 ms`,p95 约 `624.12 ms`,平均约 `321.91 ms` | ||
| 573 | -- `nllb-200-distilled-600m en->zh`:p50 约 `481.61 ms`,p95 约 `1171.71 ms`,平均约 `542.47 ms` | 655 | +`opus-mt-zh-en zh -> en` |
| 574 | 656 | ||
| 575 | -当前压测环境: | ||
| 576 | -- GPU:`Tesla T4 16GB` | ||
| 577 | -- Python env:`.venv-translator` | ||
| 578 | -- 数据量:`18,576` 条商品标题 | 657 | +| Batch | Items/s | Avg item ms | Req p95 ms | |
| 658 | +|---:|---:|---:|---:| | ||
| 659 | +| 1 | 6.15 | 162.536 | 274.74 | | ||
| 660 | +| 4 | 15.34 | 65.192 | 356.0 | | ||
| 661 | +| 8 | 25.51 | 39.202 | 379.84 | | ||
| 662 | +| 16 | 41.44 | 24.129 | 797.93 | | ||
| 663 | +| 32 | 54.36 | 18.397 | 1693.31 | | ||
| 664 | +| 64 | 70.15 | 14.255 | 2161.59 | | ||
| 665 | + | ||
| 666 | +`opus-mt-en-zh en -> zh` | ||
| 667 | + | ||
| 668 | +| Batch | Items/s | Avg item ms | Req p95 ms | | ||
| 669 | +|---:|---:|---:|---:| | ||
| 670 | +| 1 | 4.53 | 220.598 | 411.57 | | ||
| 671 | +| 4 | 10.12 | 98.844 | 761.49 | | ||
| 672 | +| 8 | 14.63 | 68.361 | 1930.85 | | ||
| 673 | +| 16 | 24.33 | 41.1 | 2098.54 | | ||
| 674 | +| 32 | 33.91 | 29.487 | 2152.28 | | ||
| 675 | +| 64 | 42.47 | 23.547 | 2371.85 | | ||
| 676 | + | ||
| 677 | +批处理结论: | ||
| 678 | + | ||
| 679 | +- 纯吞吐看,4 个方向的最佳 raw throughput 都出现在 `batch_size=64` | ||
| 680 | +- 如果还要兼顾单个 batch 的尾延迟,`batch_size=16` 往往更均衡 | ||
| 681 | +- `opus-mt-zh-en` 是本轮 bulk 场景最快模型,`nllb en->zh` 最慢 | ||
| 682 | + | ||
| 683 | +### 10.4 单条请求并发结果 | ||
| 684 | + | ||
| 685 | +这组固定 `batch_size=1`,可以直接理解成“单条请求在不同并发下的表现”。 | ||
| 686 | + | ||
| 687 | +`nllb-200-distilled-600m zh -> en` | ||
| 688 | + | ||
| 689 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | ||
| 690 | +|---:|---:|---:|---:|---:| | ||
| 691 | +| 1 | 4.17 | 239.99 | 226.34 | 373.27 | | ||
| 692 | +| 2 | 4.1 | 477.99 | 459.36 | 703.96 | | ||
| 693 | +| 4 | 4.1 | 910.74 | 884.71 | 1227.01 | | ||
| 694 | +| 8 | 4.04 | 1697.73 | 1818.48 | 2383.8 | | ||
| 695 | +| 16 | 4.07 | 2801.91 | 3473.63 | 4145.92 | | ||
| 696 | +| 64 | 4.04 | 3714.49 | 3610.08 | 7337.3 | | ||
| 697 | + | ||
| 698 | +`nllb-200-distilled-600m en -> zh` | ||
| 699 | + | ||
| 700 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | ||
| 701 | +|---:|---:|---:|---:|---:| | ||
| 702 | +| 1 | 2.16 | 463.18 | 439.54 | 670.78 | | ||
| 703 | +| 2 | 2.15 | 920.48 | 908.27 | 1213.3 | | ||
| 704 | +| 4 | 2.16 | 1759.87 | 1771.58 | 2158.04 | | ||
| 705 | +| 8 | 2.15 | 3284.44 | 3658.45 | 3971.01 | | ||
| 706 | +| 16 | 2.14 | 5669.15 | 7117.7 | 7522.48 | | ||
| 707 | +| 64 | 2.14 | 7631.14 | 7510.97 | 14139.03 | | ||
| 708 | + | ||
| 709 | +`opus-mt-zh-en zh -> en` | ||
| 710 | + | ||
| 711 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | ||
| 712 | +|---:|---:|---:|---:|---:| | ||
| 713 | +| 1 | 9.21 | 108.53 | 91.7 | 179.12 | | ||
| 714 | +| 2 | 8.92 | 219.19 | 212.29 | 305.34 | | ||
| 715 | +| 4 | 9.09 | 411.76 | 420.08 | 583.97 | | ||
| 716 | +| 8 | 8.85 | 784.14 | 835.73 | 1043.06 | | ||
| 717 | +| 16 | 9.01 | 1278.4 | 1483.34 | 1994.56 | | ||
| 718 | +| 64 | 8.82 | 1687.08 | 1563.48 | 3381.58 | | ||
| 719 | + | ||
| 720 | +`opus-mt-en-zh en -> zh` | ||
| 721 | + | ||
| 722 | +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | | ||
| 723 | +|---:|---:|---:|---:|---:| | ||
| 724 | +| 1 | 3.6 | 277.73 | 145.85 | 1180.37 | | ||
| 725 | +| 2 | 3.55 | 559.38 | 346.71 | 1916.96 | | ||
| 726 | +| 4 | 3.53 | 997.71 | 721.04 | 2944.17 | | ||
| 727 | +| 8 | 3.51 | 1644.28 | 1590.93 | 3632.99 | | ||
| 728 | +| 16 | 3.5 | 2600.18 | 2586.34 | 5554.04 | | ||
| 729 | +| 64 | 3.52 | 3366.52 | 2780.0 | 7950.41 | | ||
| 579 | 730 | ||
| 580 | -最终性能结果: | 731 | +并发结论: |
| 732 | + | ||
| 733 | +- 当前本地 seq2seq backend 内部是单模型锁,单 worker 下提高客户端并发基本不会提升吞吐,主要会把等待时间堆到请求延迟上 | ||
| 734 | +- 线上 query 翻译如果追求稳定延迟,应优先控制在低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 | ||
| 735 | +- 在线场景里,`opus-mt-zh-en` 延迟最稳;`nllb en->zh` 最慢,且并发放大后尾延迟最明显 | ||
| 736 | + | ||
| 737 | +### 10.5 batch x concurrency 怎么看 | ||
| 738 | + | ||
| 739 | +完整矩阵见: | ||
| 740 | +- [`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) | ||
| 741 | + | ||
| 742 | +这张表主要回答两个问题: | ||
| 581 | 743 | ||
| 582 | -| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | | ||
| 583 | -|---|---|---:|---:|---:|---:|---:|---:|---:|---:| | ||
| 584 | -| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | | ||
| 585 | -| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | | ||
| 586 | -| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 | | ||
| 587 | -| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 | | 744 | +- 如果已经知道自己要跑离线批处理,`batch_size` 拉大后,在不同并发下吞吐会不会继续涨 |
| 745 | +- 如果要拿单 worker 服务扛请求,在哪个 `batch_size x concurrency` 组合下开始明显排队 | ||
| 746 | + | ||
| 747 | +本轮矩阵的共同特征: | ||
| 748 | + | ||
| 749 | +- 吞吐主要由 `batch_size` 决定,`concurrency` 不是主要增益来源 | ||
| 750 | +- 在 `batch_size` 固定时,`concurrency` 从 `1` 升到 `2/4/8/...`,`items/s` 变化很小,但 `avg req ms / p95` 会持续抬升 | ||
| 751 | +- 因此当前实现更像“单 worker + 内部串行 GPU 推理服务”,不是一个靠客户端并发放大吞吐的服务 | ||
| 588 | 752 | ||
| 589 | NLLB 性能优化经验: | 753 | NLLB 性能优化经验: |
| 590 | 754 | ||
| @@ -632,7 +796,7 @@ NLLB 性能优化经验: | @@ -632,7 +796,7 @@ NLLB 性能优化经验: | ||
| 632 | - 运行方式:单 worker,避免重复加载 | 796 | - 运行方式:单 worker,避免重复加载 |
| 633 | 797 | ||
| 634 | 更详细的性能说明见: | 798 | 更详细的性能说明见: |
| 635 | -- [`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) | 799 | +- [`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) |
| 636 | 800 | ||
| 637 | ## 11. 开发说明 | 801 | ## 11. 开发说明 |
| 638 | 802 |