diff --git a/docs/TODO.txt b/docs/TODO.txt index 1a9090d..67751e1 100644 --- a/docs/TODO.txt +++ b/docs/TODO.txt @@ -1,6 +1,32 @@ -product_enrich : Partial Mode + + +nllb-200-distilled-600M性能优化 +请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型,有哪些性能优化方案,提高线上翻译服务的吞吐量、降低耗时,搜索相关的在线推理服务方案,找到高性能的服务化方法 + +cnclip的性能优化 + +rerank 性能优化 + + +超时 +Query 分析阶段等待翻译/embedding 的硬超时 +配置文件位置:config/config.yaml +配置项:query_config.async_wait_timeout_ms: 80 +代码生效点:query/query_parser.py 使用该值换算成秒传给 wait(...) +2) Embedding HTTP 调用超时(Text/Image) +不再使用任何环境变量覆盖(之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用) +配置文件位置:config/config.yaml +配置项:services.embedding.providers.http.timeout_sec(已在 YAML 里补了示例默认 60) +代码生效点: +embeddings/text_encoder.py:requests.post(..., timeout=self.timeout_sec) +embeddings/image_encoder.py:requests.post(..., timeout=self.timeout_sec) + + + + +product_enrich : Partial Mode : done https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR 需在messages 数组中将最后一条消息的 role 设置为 assistant,并在其 content 中提供前缀,在此消息中设置参数 "partial": true。messages格式如下: [ @@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men } ] 模型会以前缀内容为起点开始生成。 - 支持 非思考模式。 @@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men -翻译的cache需要重构 - - - - - suggest 索引,现在是全量脚本,要交给金伟 diff --git a/perf_reports/20260318/translation_local_models/README.md b/perf_reports/20260318/translation_local_models/README.md new file mode 100644 index 0000000..cdee75b --- /dev/null +++ b/perf_reports/20260318/translation_local_models/README.md @@ -0,0 +1,101 @@ +# Local Translation Model Benchmark Report + +测试脚本: +- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) + +完整结果: +- Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) +- JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) + +测试时间: +- `2026-03-18` + +环境: +- GPU:`Tesla T4 16GB` +- Driver / CUDA:`570.158.01 / 12.8` +- Python env:`.venv-translator` +- Torch / Transformers:`2.10.0+cu128 / 5.3.0` +- 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) + +## Method + +这轮把结果拆成 3 类: + +- `batch_sweep` + 固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64` +- `concurrency_sweep` + 固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64` +- `batch x concurrency matrix` + 组合压测,保留 `batch_size * concurrency <= 128` + +统一设定: +- 关闭 cache:`--disable-cache` +- `batch_sweep`:每档 `256` items +- `concurrency_sweep`:每档 `32` requests +- `matrix`:每档 `32` requests +- 预热:`1` batch + +复现命令: + +```bash +cd /data/saas-search +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ + --suite extended \ + --disable-cache \ + --serial-items-per-case 256 \ + --concurrency-requests-per-case 32 \ + --concurrency-batch-size 1 \ + --output-dir perf_reports/20260318/translation_local_models +``` + +## Key Results + +### 1. 单流 batch sweep + +| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms | +|---|---|---:|---:|---:|---:| +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` | +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` | +| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` | +| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` | + +解读: +- 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高 +- 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选 +- 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh` + +### 2. 单条请求并发 sweep + +| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms | +|---|---|---:|---:|---:|---:| +| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` | +| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` | +| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` | +| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` | + +解读: +- `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟 +- 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 +- 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh` + +### 3. batch x concurrency matrix + +最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下): + +| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms | +|---|---|---:|---:|---:|---:|---:| +| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` | +| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` | +| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` | +| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` | + +解读: +- 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定 +- 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升 +- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐 + +## Recommendation + +- 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径 +- 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64` +- 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段 diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl diff --git a/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl diff --git a/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md b/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md new file mode 100644 index 0000000..ec0a927 --- /dev/null +++ b/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md @@ -0,0 +1,263 @@ +# Local Translation Model Extended Benchmark + +- Generated at: `2026-03-18T21:28:09` +- Suite: `extended` +- Python: `3.12.3` +- Torch: `2.10.0+cu128` +- Transformers: `5.3.0` +- CUDA: `True` +- GPU: `Tesla T4` (15.56 GiB) + +## Reading Guide + +- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes. +- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises. +- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured. + +## nllb-200-distilled-600m zh->en + +- Direction: `zh -> en` +- Column: `title_cn` +- Loaded rows: `2048` +- Load time: `6.118 s` +- Device: `cuda` +- DType: `torch.float16` +- Cache disabled: `True` + +### Batch Sweep (`concurrency=1`) + +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 256 | 256 | 2.91 | 2.91 | 343.48 | 315.98 | 616.27 | 2.633 | +| 4 | 256 | 64 | 8.44 | 2.11 | 474.17 | 474.75 | 722.95 | 2.659 | +| 8 | 256 | 32 | 14.85 | 1.86 | 538.68 | 564.97 | 728.47 | 2.699 | +| 16 | 256 | 16 | 27.28 | 1.7 | 586.59 | 633.16 | 769.18 | 2.765 | +| 32 | 256 | 8 | 38.6 | 1.21 | 829.04 | 761.74 | 1369.88 | 2.987 | +| 64 | 256 | 4 | 58.3 | 0.91 | 1097.73 | 919.35 | 1659.9 | 3.347 | + +### Concurrency Sweep (`batch_size=1`) + +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 32 | 32 | 4.17 | 4.17 | 239.99 | 226.34 | 373.27 | 3.347 | +| 2 | 32 | 32 | 4.1 | 4.1 | 477.99 | 459.36 | 703.96 | 3.347 | +| 4 | 32 | 32 | 4.1 | 4.1 | 910.74 | 884.71 | 1227.01 | 3.347 | +| 8 | 32 | 32 | 4.04 | 4.04 | 1697.73 | 1818.48 | 2383.8 | 3.347 | +| 16 | 32 | 32 | 4.07 | 4.07 | 2801.91 | 3473.63 | 4145.92 | 3.347 | +| 64 | 32 | 32 | 4.04 | 4.04 | 3714.49 | 3610.08 | 7337.3 | 3.347 | + +### Batch x Concurrency Matrix + +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 1 | 32 | 32 | 3.94 | 3.94 | 253.96 | 231.8 | 378.76 | 3.347 | +| 1 | 2 | 32 | 32 | 3.92 | 3.92 | 500.35 | 480.18 | 747.3 | 3.347 | +| 1 | 4 | 32 | 32 | 4.05 | 4.05 | 922.53 | 894.57 | 1235.7 | 3.347 | +| 1 | 8 | 32 | 32 | 4.08 | 4.08 | 1679.71 | 1806.95 | 2346.11 | 3.347 | +| 1 | 16 | 32 | 32 | 4.05 | 4.05 | 2816.51 | 3485.68 | 4181.39 | 3.347 | +| 1 | 64 | 32 | 32 | 4.06 | 4.06 | 3711.9 | 3614.34 | 7319.52 | 3.347 | +| 4 | 1 | 128 | 32 | 9.12 | 2.28 | 438.42 | 386.12 | 724.47 | 3.347 | +| 4 | 2 | 128 | 32 | 9.2 | 2.3 | 852.37 | 756.47 | 1366.7 | 3.347 | +| 4 | 4 | 128 | 32 | 9.17 | 2.29 | 1642.9 | 1524.56 | 2662.49 | 3.347 | +| 4 | 8 | 128 | 32 | 9.18 | 2.29 | 2974.7 | 3168.19 | 4594.99 | 3.347 | +| 4 | 16 | 128 | 32 | 9.23 | 2.31 | 4897.43 | 5815.91 | 8210.23 | 3.347 | +| 8 | 1 | 256 | 32 | 14.73 | 1.84 | 543.03 | 556.97 | 734.67 | 3.347 | +| 8 | 2 | 256 | 32 | 14.73 | 1.84 | 1064.16 | 1132.78 | 1405.53 | 3.347 | +| 8 | 4 | 256 | 32 | 14.79 | 1.85 | 2035.46 | 2237.69 | 2595.53 | 3.347 | +| 8 | 8 | 256 | 32 | 14.76 | 1.84 | 3763.14 | 4345.6 | 5025.02 | 3.347 | +| 8 | 16 | 256 | 32 | 14.77 | 1.85 | 6383.08 | 8053.36 | 9380.87 | 3.347 | +| 16 | 1 | 512 | 32 | 24.88 | 1.55 | 643.11 | 661.91 | 814.26 | 3.347 | +| 16 | 2 | 512 | 32 | 24.91 | 1.56 | 1265.17 | 1326.85 | 1576.96 | 3.347 | +| 16 | 4 | 512 | 32 | 24.94 | 1.56 | 2446.55 | 2594.26 | 3027.37 | 3.347 | +| 16 | 8 | 512 | 32 | 24.8 | 1.55 | 4548.92 | 5035.56 | 5852.87 | 3.347 | +| 32 | 1 | 1024 | 32 | 34.9 | 1.09 | 916.78 | 823.68 | 1637.59 | 3.347 | +| 32 | 2 | 1024 | 32 | 34.98 | 1.09 | 1807.39 | 1717.89 | 2514.27 | 3.347 | +| 32 | 4 | 1024 | 32 | 34.85 | 1.09 | 3513.42 | 3779.99 | 4750.34 | 3.347 | +| 64 | 1 | 2048 | 32 | 53.24 | 0.83 | 1202.07 | 993.59 | 1838.72 | 3.642 | +| 64 | 2 | 2048 | 32 | 53.95 | 0.84 | 2344.92 | 2465.53 | 3562.04 | 3.642 | + +## nllb-200-distilled-600m en->zh + +- Direction: `en -> zh` +- Column: `title` +- Loaded rows: `2048` +- Load time: `6.137 s` +- Device: `cuda` +- DType: `torch.float16` +- Cache disabled: `True` + +### Batch Sweep (`concurrency=1`) + +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 256 | 256 | 1.91 | 1.91 | 524.91 | 497.81 | 866.33 | 2.636 | +| 4 | 256 | 64 | 4.94 | 1.23 | 809.89 | 743.87 | 1599.74 | 2.684 | +| 8 | 256 | 32 | 8.25 | 1.03 | 969.5 | 796.64 | 1632.29 | 2.723 | +| 16 | 256 | 16 | 13.52 | 0.85 | 1183.29 | 1169.28 | 1649.65 | 2.817 | +| 32 | 256 | 8 | 21.27 | 0.66 | 1504.54 | 1647.92 | 1827.16 | 3.007 | +| 64 | 256 | 4 | 32.64 | 0.51 | 1961.02 | 1985.8 | 2031.25 | 3.408 | + +### Concurrency Sweep (`batch_size=1`) + +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 32 | 32 | 2.16 | 2.16 | 463.18 | 439.54 | 670.78 | 3.408 | +| 2 | 32 | 32 | 2.15 | 2.15 | 920.48 | 908.27 | 1213.3 | 3.408 | +| 4 | 32 | 32 | 2.16 | 2.16 | 1759.87 | 1771.58 | 2158.04 | 3.408 | +| 8 | 32 | 32 | 2.15 | 2.15 | 3284.44 | 3658.45 | 3971.01 | 3.408 | +| 16 | 32 | 32 | 2.14 | 2.14 | 5669.15 | 7117.7 | 7522.48 | 3.408 | +| 64 | 32 | 32 | 2.14 | 2.14 | 7631.14 | 7510.97 | 14139.03 | 3.408 | + +### Batch x Concurrency Matrix + +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 1 | 32 | 32 | 2.17 | 2.17 | 461.18 | 439.28 | 673.59 | 3.408 | +| 1 | 2 | 32 | 32 | 2.15 | 2.15 | 919.12 | 904.83 | 1209.83 | 3.408 | +| 1 | 4 | 32 | 32 | 2.14 | 2.14 | 1771.34 | 1810.99 | 2171.71 | 3.408 | +| 1 | 8 | 32 | 32 | 2.12 | 2.12 | 3324.85 | 3704.57 | 3999.29 | 3.408 | +| 1 | 16 | 32 | 32 | 2.15 | 2.15 | 5651.09 | 7121.8 | 7495.67 | 3.408 | +| 1 | 64 | 32 | 32 | 2.15 | 2.15 | 7619.16 | 7505.14 | 14111.53 | 3.408 | +| 4 | 1 | 128 | 32 | 5.13 | 1.28 | 780.0 | 661.74 | 1586.9 | 3.408 | +| 4 | 2 | 128 | 32 | 5.09 | 1.27 | 1551.81 | 1368.81 | 2775.92 | 3.408 | +| 4 | 4 | 128 | 32 | 5.13 | 1.28 | 2995.57 | 2836.77 | 4532.67 | 3.408 | +| 4 | 8 | 128 | 32 | 5.16 | 1.29 | 5619.47 | 6046.81 | 7767.22 | 3.408 | +| 4 | 16 | 128 | 32 | 5.15 | 1.29 | 9546.3 | 12146.68 | 14537.41 | 3.408 | +| 8 | 1 | 256 | 32 | 8.36 | 1.04 | 957.45 | 784.85 | 1591.01 | 3.408 | +| 8 | 2 | 256 | 32 | 8.29 | 1.04 | 1906.56 | 1847.53 | 2799.1 | 3.408 | +| 8 | 4 | 256 | 32 | 8.3 | 1.04 | 3691.59 | 3773.67 | 4849.58 | 3.408 | +| 8 | 8 | 256 | 32 | 8.29 | 1.04 | 6954.22 | 7681.7 | 8904.0 | 3.408 | +| 8 | 16 | 256 | 32 | 8.31 | 1.04 | 11923.83 | 14903.17 | 17072.19 | 3.408 | +| 16 | 1 | 512 | 32 | 13.83 | 0.86 | 1156.84 | 904.4 | 1609.17 | 3.408 | +| 16 | 2 | 512 | 32 | 13.91 | 0.87 | 2276.35 | 2356.15 | 3121.45 | 3.408 | +| 16 | 4 | 512 | 32 | 13.89 | 0.87 | 4440.77 | 4733.34 | 5488.65 | 3.408 | +| 16 | 8 | 512 | 32 | 13.83 | 0.86 | 8428.17 | 9407.5 | 10409.59 | 3.408 | +| 32 | 1 | 1024 | 32 | 23.28 | 0.73 | 1374.39 | 1603.35 | 1806.99 | 3.408 | +| 32 | 2 | 1024 | 32 | 23.26 | 0.73 | 2721.28 | 2607.91 | 3437.08 | 3.408 | +| 32 | 4 | 1024 | 32 | 23.18 | 0.72 | 5297.48 | 5234.53 | 6758.76 | 3.408 | +| 64 | 1 | 2048 | 32 | 34.97 | 0.55 | 1829.91 | 1899.54 | 2039.18 | 3.69 | +| 64 | 2 | 2048 | 32 | 34.78 | 0.54 | 3615.7 | 3795.36 | 4014.19 | 3.69 | + +## opus-mt-zh-en zh->en + +- Direction: `zh -> en` +- Column: `title_cn` +- Loaded rows: `2048` +- Load time: `3.2561 s` +- Device: `cuda` +- DType: `torch.float16` +- Cache disabled: `True` + +### Batch Sweep (`concurrency=1`) + +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 256 | 256 | 6.15 | 6.15 | 162.53 | 145.96 | 274.74 | 0.285 | +| 4 | 256 | 64 | 15.34 | 3.83 | 260.77 | 231.11 | 356.0 | 0.306 | +| 8 | 256 | 32 | 25.51 | 3.19 | 313.61 | 271.09 | 379.84 | 0.324 | +| 16 | 256 | 16 | 41.44 | 2.59 | 386.06 | 286.4 | 797.93 | 0.366 | +| 32 | 256 | 8 | 54.36 | 1.7 | 588.7 | 330.13 | 1693.31 | 0.461 | +| 64 | 256 | 4 | 70.15 | 1.1 | 912.33 | 447.93 | 2161.59 | 0.642 | + +### Concurrency Sweep (`batch_size=1`) + +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 32 | 32 | 9.21 | 9.21 | 108.53 | 91.7 | 179.12 | 0.642 | +| 2 | 32 | 32 | 8.92 | 8.92 | 219.19 | 212.29 | 305.34 | 0.642 | +| 4 | 32 | 32 | 9.09 | 9.09 | 411.76 | 420.08 | 583.97 | 0.642 | +| 8 | 32 | 32 | 8.85 | 8.85 | 784.14 | 835.73 | 1043.06 | 0.642 | +| 16 | 32 | 32 | 9.01 | 9.01 | 1278.4 | 1483.34 | 1994.56 | 0.642 | +| 64 | 32 | 32 | 8.82 | 8.82 | 1687.08 | 1563.48 | 3381.58 | 0.642 | + +### Batch x Concurrency Matrix + +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 1 | 32 | 32 | 9.01 | 9.01 | 110.96 | 97.37 | 191.91 | 0.642 | +| 1 | 2 | 32 | 32 | 9.2 | 9.2 | 212.43 | 191.27 | 299.65 | 0.642 | +| 1 | 4 | 32 | 32 | 8.86 | 8.86 | 423.8 | 423.61 | 596.78 | 0.642 | +| 1 | 8 | 32 | 32 | 9.04 | 9.04 | 764.39 | 810.05 | 1035.74 | 0.642 | +| 1 | 16 | 32 | 32 | 9.06 | 9.06 | 1272.52 | 1468.35 | 1998.99 | 0.642 | +| 1 | 64 | 32 | 32 | 9.04 | 9.04 | 1663.7 | 1556.13 | 3296.1 | 0.642 | +| 4 | 1 | 128 | 32 | 15.17 | 3.79 | 263.61 | 201.11 | 369.93 | 0.642 | +| 4 | 2 | 128 | 32 | 15.25 | 3.81 | 517.25 | 397.13 | 1319.45 | 0.642 | +| 4 | 4 | 128 | 32 | 15.27 | 3.82 | 899.61 | 746.76 | 1914.27 | 0.642 | +| 4 | 8 | 128 | 32 | 15.39 | 3.85 | 1539.87 | 1466.41 | 2918.85 | 0.642 | +| 4 | 16 | 128 | 32 | 15.48 | 3.87 | 2455.89 | 2728.17 | 4644.99 | 0.642 | +| 8 | 1 | 256 | 32 | 25.44 | 3.18 | 314.41 | 272.16 | 386.05 | 0.642 | +| 8 | 2 | 256 | 32 | 25.42 | 3.18 | 620.41 | 541.88 | 1418.79 | 0.642 | +| 8 | 4 | 256 | 32 | 24.96 | 3.12 | 1226.44 | 1070.41 | 2940.56 | 0.642 | +| 8 | 8 | 256 | 32 | 25.44 | 3.18 | 2252.81 | 2148.42 | 4143.66 | 0.642 | +| 8 | 16 | 256 | 32 | 25.44 | 3.18 | 3961.06 | 5028.31 | 6340.79 | 0.642 | +| 16 | 1 | 512 | 32 | 31.66 | 1.98 | 505.38 | 321.69 | 1963.58 | 0.653 | +| 16 | 2 | 512 | 32 | 31.41 | 1.96 | 1009.12 | 677.84 | 2341.01 | 0.653 | +| 16 | 4 | 512 | 32 | 31.17 | 1.95 | 2006.54 | 1562.68 | 4299.85 | 0.653 | +| 16 | 8 | 512 | 32 | 31.51 | 1.97 | 3483.96 | 4085.09 | 5902.32 | 0.653 | +| 32 | 1 | 1024 | 32 | 38.66 | 1.21 | 827.7 | 409.78 | 2162.48 | 0.748 | +| 32 | 2 | 1024 | 32 | 38.75 | 1.21 | 1613.28 | 916.1 | 3228.1 | 0.748 | +| 32 | 4 | 1024 | 32 | 38.66 | 1.21 | 3162.05 | 3239.36 | 4992.07 | 0.748 | +| 64 | 1 | 2048 | 32 | 52.44 | 0.82 | 1220.35 | 533.4 | 2508.12 | 0.939 | +| 64 | 2 | 2048 | 32 | 52.04 | 0.81 | 2443.4 | 2800.35 | 4765.45 | 0.939 | + +## opus-mt-en-zh en->zh + +- Direction: `en -> zh` +- Column: `title` +- Loaded rows: `2048` +- Load time: `3.1612 s` +- Device: `cuda` +- DType: `torch.float16` +- Cache disabled: `True` + +### Batch Sweep (`concurrency=1`) + +| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 256 | 256 | 4.53 | 4.53 | 220.59 | 177.37 | 411.57 | 0.285 | +| 4 | 256 | 64 | 10.12 | 2.53 | 395.37 | 319.23 | 761.49 | 0.307 | +| 8 | 256 | 32 | 14.63 | 1.83 | 546.88 | 390.81 | 1930.85 | 0.326 | +| 16 | 256 | 16 | 24.33 | 1.52 | 657.6 | 451.22 | 2098.54 | 0.37 | +| 32 | 256 | 8 | 33.91 | 1.06 | 943.59 | 603.45 | 2152.28 | 0.462 | +| 64 | 256 | 4 | 42.47 | 0.66 | 1507.01 | 1531.43 | 2371.85 | 0.644 | + +### Concurrency Sweep (`batch_size=1`) + +| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 32 | 32 | 3.6 | 3.6 | 277.73 | 145.85 | 1180.37 | 0.644 | +| 2 | 32 | 32 | 3.55 | 3.55 | 559.38 | 346.71 | 1916.96 | 0.644 | +| 4 | 32 | 32 | 3.53 | 3.53 | 997.71 | 721.04 | 2944.17 | 0.644 | +| 8 | 32 | 32 | 3.51 | 3.51 | 1644.28 | 1590.93 | 3632.99 | 0.644 | +| 16 | 32 | 32 | 3.5 | 3.5 | 2600.18 | 2586.34 | 5554.04 | 0.644 | +| 64 | 32 | 32 | 3.52 | 3.52 | 3366.52 | 2780.0 | 7950.41 | 0.644 | + +### Batch x Concurrency Matrix + +| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 1 | 1 | 32 | 32 | 3.54 | 3.54 | 282.22 | 148.26 | 1200.93 | 0.644 | +| 1 | 2 | 32 | 32 | 3.52 | 3.52 | 563.85 | 346.8 | 1957.63 | 0.644 | +| 1 | 4 | 32 | 32 | 3.52 | 3.52 | 1002.79 | 728.36 | 2949.99 | 0.644 | +| 1 | 8 | 32 | 32 | 3.55 | 3.55 | 1611.52 | 1462.84 | 3653.34 | 0.644 | +| 1 | 16 | 32 | 32 | 3.49 | 3.49 | 2582.77 | 2594.49 | 5504.07 | 0.644 | +| 1 | 64 | 32 | 32 | 3.53 | 3.53 | 3327.65 | 2753.11 | 7907.54 | 0.644 | +| 4 | 1 | 128 | 32 | 10.31 | 2.58 | 387.75 | 329.42 | 675.07 | 0.644 | +| 4 | 2 | 128 | 32 | 10.12 | 2.53 | 784.49 | 713.61 | 1561.27 | 0.644 | +| 4 | 4 | 128 | 32 | 10.08 | 2.52 | 1543.42 | 1438.42 | 2735.85 | 0.644 | +| 4 | 8 | 128 | 32 | 10.09 | 2.52 | 2918.06 | 2954.39 | 4327.47 | 0.644 | +| 4 | 16 | 128 | 32 | 10.16 | 2.54 | 5087.62 | 5734.85 | 7343.34 | 0.644 | +| 8 | 1 | 256 | 32 | 14.49 | 1.81 | 552.16 | 402.18 | 1963.8 | 0.644 | +| 8 | 2 | 256 | 32 | 14.54 | 1.82 | 1088.96 | 845.04 | 2565.11 | 0.644 | +| 8 | 4 | 256 | 32 | 14.48 | 1.81 | 2143.18 | 1832.69 | 4487.97 | 0.644 | +| 8 | 8 | 256 | 32 | 14.51 | 1.81 | 4051.04 | 3734.16 | 5976.18 | 0.644 | +| 8 | 16 | 256 | 32 | 14.55 | 1.82 | 6626.6 | 6868.91 | 9395.9 | 0.644 | +| 16 | 1 | 512 | 32 | 19.11 | 1.19 | 837.36 | 457.51 | 2030.2 | 0.661 | +| 16 | 2 | 512 | 32 | 19.26 | 1.2 | 1599.87 | 1141.06 | 2802.44 | 0.661 | +| 16 | 4 | 512 | 32 | 19.1 | 1.19 | 3130.81 | 3166.45 | 4987.49 | 0.661 | +| 16 | 8 | 512 | 32 | 19.27 | 1.2 | 5735.13 | 5351.82 | 8417.85 | 0.661 | +| 32 | 1 | 1024 | 32 | 26.04 | 0.81 | 1228.66 | 648.06 | 2167.39 | 0.751 | +| 32 | 2 | 1024 | 32 | 26.21 | 0.82 | 2372.52 | 2593.74 | 4206.96 | 0.751 | +| 32 | 4 | 1024 | 32 | 26.11 | 0.82 | 4499.96 | 3977.65 | 6920.57 | 0.751 | +| 64 | 1 | 2048 | 32 | 34.94 | 0.55 | 1831.48 | 2366.24 | 2473.74 | 0.942 | +| 64 | 2 | 2048 | 32 | 34.75 | 0.54 | 3603.99 | 3281.6 | 4948.72 | 0.942 | diff --git a/scripts/benchmark_translation_local_models.py b/scripts/benchmark_translation_local_models.py index 911f3d0..845fc44 100644 --- a/scripts/benchmark_translation_local_models.py +++ b/scripts/benchmark_translation_local_models.py @@ -4,6 +4,7 @@ from __future__ import annotations import argparse +import concurrent.futures import copy import csv import json @@ -16,7 +17,7 @@ import sys import time from datetime import datetime from pathlib import Path -from typing import Any, Dict, Iterable, List +from typing import Any, Dict, Iterable, List, Sequence import torch import transformers @@ -30,6 +31,9 @@ from translation.service import TranslationService # noqa: E402 from translation.settings import get_translation_capability # noqa: E402 +DEFAULT_BATCH_SIZES = [1, 4, 8, 16, 32, 64] +DEFAULT_CONCURRENCIES = [1, 2, 4, 8, 16, 64] + SCENARIOS: List[Dict[str, str]] = [ { "name": "nllb-200-distilled-600m zh->en", @@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [ def parse_args() -> argparse.Namespace: parser = argparse.ArgumentParser(description="Benchmark local translation models") parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path") - parser.add_argument("--limit", type=int, default=0, help="Limit rows for faster experiments; 0 means all") + parser.add_argument("--limit", type=int, default=0, help="Limit rows for baseline or single-case run; 0 means all") parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports") parser.add_argument("--single", action="store_true", help="Run a single scenario in-process") parser.add_argument("--model", default="", help="Model name for --single mode") @@ -84,9 +88,67 @@ def parse_args() -> argparse.Namespace: parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams") parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa") parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring") + parser.add_argument("--disable-cache", action="store_true", help="Disable translation cache during benchmarks") + parser.add_argument( + "--suite", + choices=["baseline", "extended"], + default="baseline", + help="baseline keeps the previous all-scenarios summary; extended adds batch/concurrency/matrix sweeps", + ) + parser.add_argument( + "--batch-size-list", + default="", + help="Comma-separated batch sizes for extended suite; default 1,4,8,16,32,64", + ) + parser.add_argument( + "--concurrency-list", + default="", + help="Comma-separated concurrency levels for extended suite; default 1,2,4,8,16,64", + ) + parser.add_argument( + "--serial-items-per-case", + type=int, + default=512, + help="Items per batch-size case in extended suite", + ) + parser.add_argument( + "--concurrency-requests-per-case", + type=int, + default=128, + help="Requests per concurrency or matrix case in extended suite", + ) + parser.add_argument( + "--concurrency-batch-size", + type=int, + default=1, + help="Batch size used by the dedicated concurrency sweep", + ) + parser.add_argument( + "--max-batch-concurrency-product", + type=int, + default=128, + help="Skip matrix cases where batch_size * concurrency exceeds this value; 0 disables the limit", + ) return parser.parse_args() +def parse_csv_ints(raw: str, fallback: Sequence[int]) -> List[int]: + if not raw.strip(): + return list(fallback) + values: List[int] = [] + for item in raw.split(","): + stripped = item.strip() + if not stripped: + continue + value = int(stripped) + if value <= 0: + raise ValueError(f"Expected positive integer, got {value}") + values.append(value) + if not values: + raise ValueError("Parsed empty integer list") + return values + + def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: texts: List[str] = [] with csv_path.open("r", encoding="utf-8") as handle: @@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -> List[str]: return texts -def batched(values: List[str], batch_size: int) -> Iterable[List[str]]: +def batched(values: Sequence[str], batch_size: int) -> Iterable[List[str]]: for start in range(0, len(values), batch_size): - yield values[start:start + batch_size] + yield list(values[start:start + batch_size]) def percentile(values: List[float], p: float) -> float: @@ -148,15 +210,34 @@ def build_environment_info() -> Dict[str, Any]: } -def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: - csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) +def scenario_from_args(args: argparse.Namespace) -> Dict[str, str]: + return { + "name": f"{args.model} {args.source_lang}->{args.target_lang}", + "model": args.model, + "source_lang": args.source_lang, + "target_lang": args.target_lang, + "column": args.column, + "scene": args.scene, + } + + +def build_config_and_capability( + args: argparse.Namespace, + *, + batch_size_override: int | None = None, +) -> tuple[Dict[str, Any], Dict[str, Any]]: config = copy.deepcopy(get_translation_config()) + for name, cfg in config["capabilities"].items(): + cfg["enabled"] = name == args.model + config["default_model"] = args.model capability = get_translation_capability(config, args.model, require_enabled=False) if args.device_override: capability["device"] = args.device_override if args.torch_dtype_override: capability["torch_dtype"] = args.torch_dtype_override - if args.batch_size: + if batch_size_override is not None: + capability["batch_size"] = batch_size_override + elif args.batch_size: capability["batch_size"] = args.batch_size if args.max_new_tokens: capability["max_new_tokens"] = args.max_new_tokens @@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: capability["num_beams"] = args.num_beams if args.attn_implementation: capability["attn_implementation"] = args.attn_implementation + if args.disable_cache: + capability["use_cache"] = False config["capabilities"][args.model] = capability - configured_batch_size = int(capability.get("batch_size") or 1) - batch_size = configured_batch_size - texts = load_texts(csv_path, args.column, args.limit) + return config, capability - service = TranslationService(config) + +def ensure_cuda_stats_reset() -> None: if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.reset_peak_memory_stats() - load_start = time.perf_counter() - backend = service.get_backend(args.model) - load_seconds = time.perf_counter() - load_start - warmup_batches = min(max(args.warmup_batches, 0), max(1, math.ceil(len(texts) / batch_size))) - for batch in list(batched(texts, batch_size))[:warmup_batches]: +def build_memory_metrics() -> Dict[str, Any]: + peak_gpu_mem_gb = None + peak_gpu_reserved_gb = None + if torch.cuda.is_available(): + peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3) + peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) + max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2) + return { + "max_rss_mb": max_rss_mb, + "peak_gpu_memory_gb": peak_gpu_mem_gb, + "peak_gpu_reserved_gb": peak_gpu_reserved_gb, + } + + +def make_request_payload(batch: Sequence[str]) -> str | List[str]: + if len(batch) == 1: + return batch[0] + return list(batch) + + +def benchmark_serial_case( + *, + service: TranslationService, + backend: Any, + scenario: Dict[str, str], + capability: Dict[str, Any], + texts: List[str], + batch_size: int, + warmup_batches: int, +) -> Dict[str, Any]: + backend.batch_size = batch_size + measured_batches = list(batched(texts, batch_size)) + warmup_count = min(max(warmup_batches, 0), len(measured_batches)) + + for batch in measured_batches[:warmup_count]: service.translate( - text=batch, - source_lang=args.source_lang, - target_lang=args.target_lang, - model=args.model, - scene=args.scene, + text=make_request_payload(batch), + source_lang=scenario["source_lang"], + target_lang=scenario["target_lang"], + model=scenario["model"], + scene=scenario["scene"], ) batch_latencies_ms: List[float] = [] @@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: failure_count = 0 output_chars = 0 total_input_chars = sum(len(text) for text in texts) - measured_batches = list(batched(texts, batch_size)) start = time.perf_counter() for batch in measured_batches: batch_start = time.perf_counter() outputs = service.translate( - text=batch, - source_lang=args.source_lang, - target_lang=args.target_lang, - model=args.model, - scene=args.scene, + text=make_request_payload(batch), + source_lang=scenario["source_lang"], + target_lang=scenario["target_lang"], + model=scenario["model"], + scene=scenario["scene"], ) elapsed_ms = (time.perf_counter() - batch_start) * 1000 batch_latencies_ms.append(elapsed_ms) - if not isinstance(outputs, list): - raise RuntimeError(f"Expected list output for batch translation, got {type(outputs)!r}") - for item in outputs: + if isinstance(outputs, list): + result_items = outputs + else: + result_items = [outputs] + for item in result_items: if item is None: failure_count += 1 else: success_count += 1 output_chars += len(item) translate_seconds = time.perf_counter() - start + total_items = len(texts) + memory = build_memory_metrics() - peak_gpu_mem_gb = None - peak_gpu_reserved_gb = None - if torch.cuda.is_available(): - peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3) - peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3) + return { + "mode": "serial_batch", + "batch_size": batch_size, + "concurrency": 1, + "rows": total_items, + "requests": len(measured_batches), + "input_chars": total_input_chars, + "load_seconds": 0.0, + "translate_seconds": round(translate_seconds, 4), + "total_seconds": round(translate_seconds, 4), + "batch_count": len(batch_latencies_ms), + "request_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2), + "request_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2), + "request_latency_max_ms": round(max(batch_latencies_ms), 2), + "avg_request_latency_ms": round(statistics.fmean(batch_latencies_ms), 2), + "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3), + "requests_per_second": round(len(measured_batches) / translate_seconds, 2), + "items_per_second": round(total_items / translate_seconds, 2), + "input_chars_per_second": round(total_input_chars / translate_seconds, 2), + "output_chars_per_second": round(output_chars / translate_seconds, 2), + "success_count": success_count, + "failure_count": failure_count, + "success_rate": round(success_count / total_items, 6), + "device": str(getattr(backend, "device", capability.get("device", "unknown"))), + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), + "configured_batch_size": int(capability.get("batch_size") or batch_size), + "used_batch_size": batch_size, + "warmup_batches": warmup_count, + **memory, + } - max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2) - total_items = len(texts) + +def benchmark_concurrency_case( + *, + service: TranslationService, + backend: Any, + scenario: Dict[str, str], + capability: Dict[str, Any], + texts: List[str], + batch_size: int, + concurrency: int, + requests_per_case: int, + warmup_batches: int, +) -> Dict[str, Any]: + backend.batch_size = batch_size + required_items = batch_size * requests_per_case + case_texts = texts[:required_items] + request_batches = list(batched(case_texts, batch_size)) + if not request_batches: + raise ValueError("No request batches prepared for concurrency benchmark") + warmup_count = min(max(warmup_batches, 0), len(request_batches)) + + for batch in request_batches[:warmup_count]: + service.translate( + text=make_request_payload(batch), + source_lang=scenario["source_lang"], + target_lang=scenario["target_lang"], + model=scenario["model"], + scene=scenario["scene"], + ) + + request_latencies_ms: List[float] = [] + success_count = 0 + failure_count = 0 + output_chars = 0 + total_input_chars = sum(len(text) for text in case_texts) + + def worker(batch: List[str]) -> tuple[float, int, int, int]: + started = time.perf_counter() + outputs = service.translate( + text=make_request_payload(batch), + source_lang=scenario["source_lang"], + target_lang=scenario["target_lang"], + model=scenario["model"], + scene=scenario["scene"], + ) + elapsed_ms = (time.perf_counter() - started) * 1000 + if isinstance(outputs, list): + result_items = outputs + else: + result_items = [outputs] + local_success = 0 + local_failure = 0 + local_output_chars = 0 + for item in result_items: + if item is None: + local_failure += 1 + else: + local_success += 1 + local_output_chars += len(item) + return elapsed_ms, local_success, local_failure, local_output_chars + + wall_start = time.perf_counter() + with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor: + futures = [executor.submit(worker, batch) for batch in request_batches] + for future in concurrent.futures.as_completed(futures): + latency_ms, local_success, local_failure, local_output_chars = future.result() + request_latencies_ms.append(latency_ms) + success_count += local_success + failure_count += local_failure + output_chars += local_output_chars + wall_seconds = time.perf_counter() - wall_start + total_items = len(case_texts) + memory = build_memory_metrics() return { - "scenario": { - "name": f"{args.model} {args.source_lang}->{args.target_lang}", - "model": args.model, - "source_lang": args.source_lang, - "target_lang": args.target_lang, - "column": args.column, - "scene": args.scene, + "mode": "concurrency", + "batch_size": batch_size, + "concurrency": concurrency, + "rows": total_items, + "requests": len(request_batches), + "input_chars": total_input_chars, + "load_seconds": 0.0, + "translate_seconds": round(wall_seconds, 4), + "total_seconds": round(wall_seconds, 4), + "batch_count": len(request_latencies_ms), + "request_latency_p50_ms": round(percentile(request_latencies_ms, 0.50), 2), + "request_latency_p95_ms": round(percentile(request_latencies_ms, 0.95), 2), + "request_latency_max_ms": round(max(request_latencies_ms), 2), + "avg_request_latency_ms": round(statistics.fmean(request_latencies_ms), 2), + "avg_item_latency_ms": round((wall_seconds / total_items) * 1000, 3), + "requests_per_second": round(len(request_batches) / wall_seconds, 2), + "items_per_second": round(total_items / wall_seconds, 2), + "input_chars_per_second": round(total_input_chars / wall_seconds, 2), + "output_chars_per_second": round(output_chars / wall_seconds, 2), + "success_count": success_count, + "failure_count": failure_count, + "success_rate": round(success_count / total_items, 6), + "device": str(getattr(backend, "device", capability.get("device", "unknown"))), + "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), + "configured_batch_size": int(capability.get("batch_size") or batch_size), + "used_batch_size": batch_size, + "warmup_batches": warmup_count, + **memory, + } + + +def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]: + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) + scenario = scenario_from_args(args) + config, capability = build_config_and_capability(args) + configured_batch_size = int(capability.get("batch_size") or 1) + batch_size = configured_batch_size + texts = load_texts(csv_path, args.column, args.limit) + + ensure_cuda_stats_reset() + load_start = time.perf_counter() + service = TranslationService(config) + backend = service.get_backend(args.model) + load_seconds = time.perf_counter() - load_start + + runtime = benchmark_serial_case( + service=service, + backend=backend, + scenario=scenario, + capability=capability, + texts=texts, + batch_size=batch_size, + warmup_batches=args.warmup_batches, + ) + runtime["load_seconds"] = round(load_seconds, 4) + runtime["total_seconds"] = round(runtime["load_seconds"] + runtime["translate_seconds"], 4) + + return { + "scenario": scenario, + "dataset": { + "csv_path": str(csv_path), + "rows": len(texts), + "input_chars": sum(len(text) for text in texts), }, + "runtime": runtime, + } + + +def benchmark_extended_scenario(args: argparse.Namespace) -> Dict[str, Any]: + csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path) + scenario = scenario_from_args(args) + batch_sizes = parse_csv_ints(args.batch_size_list, DEFAULT_BATCH_SIZES) + concurrencies = parse_csv_ints(args.concurrency_list, DEFAULT_CONCURRENCIES) + largest_batch = max(batch_sizes + [args.concurrency_batch_size]) + largest_concurrency = max(concurrencies) + max_product = args.max_batch_concurrency_product + required_items = max( + args.limit or 0, + max(args.serial_items_per_case, largest_batch), + args.concurrency_requests_per_case * args.concurrency_batch_size, + largest_batch * args.concurrency_requests_per_case, + ) + texts = load_texts(csv_path, args.column, required_items) + config, capability = build_config_and_capability(args) + + ensure_cuda_stats_reset() + load_start = time.perf_counter() + service = TranslationService(config) + backend = service.get_backend(args.model) + load_seconds = time.perf_counter() - load_start + + batch_sweep: List[Dict[str, Any]] = [] + concurrency_sweep: List[Dict[str, Any]] = [] + matrix_results: List[Dict[str, Any]] = [] + + for batch_size in batch_sizes: + case_texts = texts[: max(batch_size, args.serial_items_per_case)] + batch_sweep.append( + benchmark_serial_case( + service=service, + backend=backend, + scenario=scenario, + capability=capability, + texts=case_texts, + batch_size=batch_size, + warmup_batches=args.warmup_batches, + ) + ) + + for concurrency in concurrencies: + concurrency_sweep.append( + benchmark_concurrency_case( + service=service, + backend=backend, + scenario=scenario, + capability=capability, + texts=texts, + batch_size=args.concurrency_batch_size, + concurrency=concurrency, + requests_per_case=args.concurrency_requests_per_case, + warmup_batches=args.warmup_batches, + ) + ) + + for batch_size in batch_sizes: + for concurrency in concurrencies: + if max_product > 0 and batch_size * concurrency > max_product: + continue + matrix_results.append( + benchmark_concurrency_case( + service=service, + backend=backend, + scenario=scenario, + capability=capability, + texts=texts, + batch_size=batch_size, + concurrency=concurrency, + requests_per_case=args.concurrency_requests_per_case, + warmup_batches=args.warmup_batches, + ) + ) + + for collection in (batch_sweep, concurrency_sweep, matrix_results): + for idx, item in enumerate(collection): + item["load_seconds"] = round(load_seconds if idx == 0 else 0.0, 4) + item["total_seconds"] = round(item["load_seconds"] + item["translate_seconds"], 4) + + return { + "scenario": scenario, "dataset": { "csv_path": str(csv_path), - "rows": total_items, - "input_chars": total_input_chars, + "rows_loaded": len(texts), + }, + "config": { + "batch_sizes": batch_sizes, + "concurrencies": concurrencies, + "serial_items_per_case": args.serial_items_per_case, + "concurrency_requests_per_case": args.concurrency_requests_per_case, + "concurrency_batch_size": args.concurrency_batch_size, + "max_batch_concurrency_product": max_product, + "cache_disabled": bool(args.disable_cache), }, - "runtime": { + "runtime_defaults": { "device": str(getattr(backend, "device", capability.get("device", "unknown"))), "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))), - "configured_batch_size": configured_batch_size, - "used_batch_size": batch_size, - "warmup_batches": warmup_batches, + "configured_batch_size": int(capability.get("batch_size") or 1), "load_seconds": round(load_seconds, 4), - "translate_seconds": round(translate_seconds, 4), - "total_seconds": round(load_seconds + translate_seconds, 4), - "batch_count": len(batch_latencies_ms), - "first_batch_ms": round(batch_latencies_ms[0], 2), - "batch_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2), - "batch_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2), - "batch_latency_max_ms": round(max(batch_latencies_ms), 2), - "avg_batch_latency_ms": round(statistics.fmean(batch_latencies_ms), 2), - "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3), - "items_per_second": round(total_items / translate_seconds, 2), - "input_chars_per_second": round(total_input_chars / translate_seconds, 2), - "output_chars_per_second": round(output_chars / translate_seconds, 2), - "success_count": success_count, - "failure_count": failure_count, - "success_rate": round(success_count / total_items, 6), - "max_rss_mb": max_rss_mb, - "peak_gpu_memory_gb": peak_gpu_mem_gb, - "peak_gpu_reserved_gb": peak_gpu_reserved_gb, }, + "batch_sweep": batch_sweep, + "concurrency_sweep": concurrency_sweep, + "matrix": matrix_results, } def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: report = { "generated_at": datetime.now().isoformat(timespec="seconds"), + "suite": args.suite, "environment": build_environment_info(), "scenarios": [], } @@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: scenario["scene"], "--warmup-batches", str(args.warmup_batches), + "--suite", + args.suite, + "--serial-items-per-case", + str(args.serial_items_per_case), + "--concurrency-requests-per-case", + str(args.concurrency_requests_per_case), + "--concurrency-batch-size", + str(args.concurrency_batch_size), + "--max-batch-concurrency-product", + str(args.max_batch_concurrency_product), ] if args.limit: cmd.extend(["--limit", str(args.limit)]) if args.batch_size: cmd.extend(["--batch-size", str(args.batch_size)]) + if args.batch_size_list: + cmd.extend(["--batch-size-list", args.batch_size_list]) + if args.concurrency_list: + cmd.extend(["--concurrency-list", args.concurrency_list]) if args.device_override: cmd.extend(["--device-override", args.device_override]) if args.torch_dtype_override: @@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: cmd.extend(["--num-beams", str(args.num_beams)]) if args.attn_implementation: cmd.extend(["--attn-implementation", args.attn_implementation]) + if args.disable_cache: + cmd.append("--disable-cache") completed = subprocess.run(cmd, capture_output=True, text=True, check=True) result_line = "" @@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]: return report -def render_markdown_report(report: Dict[str, Any]) -> str: +def render_baseline_markdown_report(report: Dict[str, Any]) -> str: lines = [ "# Local Translation Model Benchmark", "", f"- Generated at: `{report['generated_at']}`", + f"- Suite: `{report['suite']}`", f"- Python: `{report['environment']['python']}`", f"- Torch: `{report['environment']['torch']}`", f"- Transformers: `{report['environment']['transformers']}`", @@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -> str: lines.extend( [ "", - "| Scenario | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Load s | Peak GPU GiB | Success |", + "| Scenario | Items/s | Avg item ms | Req p50 ms | Req p95 ms | Load s | Peak GPU GiB | Success |", "|---|---:|---:|---:|---:|---:|---:|---:|", ] ) for item in report["scenarios"]: runtime = item["runtime"] lines.append( - "| {name} | {items_per_second} | {avg_item_latency_ms} | {batch_latency_p50_ms} | {batch_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( + "| {name} | {items_per_second} | {avg_item_latency_ms} | {request_latency_p50_ms} | {request_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format( name=item["scenario"]["name"], items_per_second=runtime["items_per_second"], avg_item_latency_ms=runtime["avg_item_latency_ms"], - batch_latency_p50_ms=runtime["batch_latency_p50_ms"], - batch_latency_p95_ms=runtime["batch_latency_p95_ms"], + request_latency_p50_ms=runtime["request_latency_p50_ms"], + request_latency_p95_ms=runtime["request_latency_p95_ms"], load_seconds=runtime["load_seconds"], peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"], success_rate=runtime["success_rate"], @@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -> str: f"- Load time: `{runtime['load_seconds']} s`", f"- Translate time: `{runtime['translate_seconds']} s`", f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`", - f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, batch p50 `{runtime['batch_latency_p50_ms']} ms`, batch p95 `{runtime['batch_latency_p95_ms']} ms`, batch max `{runtime['batch_latency_max_ms']} ms`", + f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, req p50 `{runtime['request_latency_p50_ms']} ms`, req p95 `{runtime['request_latency_p95_ms']} ms`, req max `{runtime['request_latency_max_ms']} ms`", f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`", f"- Success: `{runtime['success_count']}/{dataset['rows']}`", "", @@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -> str: return "\n".join(lines) +def render_case_table( + title: str, + rows: Sequence[Dict[str, Any]], + *, + include_batch: bool, + include_concurrency: bool, +) -> List[str]: + headers = ["Rows", "Requests", "Items/s", "Req/s", "Avg req ms", "Req p50 ms", "Req p95 ms", "Peak GPU GiB"] + prefix_headers: List[str] = [] + if include_batch: + prefix_headers.append("Batch") + if include_concurrency: + prefix_headers.append("Concurrency") + headers = prefix_headers + headers + lines = [f"### {title}", ""] + lines.append("| " + " | ".join(headers) + " |") + lines.append("|" + "|".join(["---:"] * len(headers)) + "|") + for item in rows: + values: List[str] = [] + if include_batch: + values.append(str(item["batch_size"])) + if include_concurrency: + values.append(str(item["concurrency"])) + values.extend( + [ + str(item["rows"]), + str(item["requests"]), + str(item["items_per_second"]), + str(item["requests_per_second"]), + str(item["avg_request_latency_ms"]), + str(item["request_latency_p50_ms"]), + str(item["request_latency_p95_ms"]), + str(item["peak_gpu_memory_gb"]), + ] + ) + lines.append("| " + " | ".join(values) + " |") + lines.append("") + return lines + + +def render_extended_markdown_report(report: Dict[str, Any]) -> str: + lines = [ + "# Local Translation Model Extended Benchmark", + "", + f"- Generated at: `{report['generated_at']}`", + f"- Suite: `{report['suite']}`", + f"- Python: `{report['environment']['python']}`", + f"- Torch: `{report['environment']['torch']}`", + f"- Transformers: `{report['environment']['transformers']}`", + f"- CUDA: `{report['environment']['cuda_available']}`", + ] + if report["environment"]["gpu_name"]: + lines.append(f"- GPU: `{report['environment']['gpu_name']}` ({report['environment']['gpu_total_mem_gb']} GiB)") + + lines.extend( + [ + "", + "## Reading Guide", + "", + "- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.", + "- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.", + "- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.", + "", + ] + ) + + for item in report["scenarios"]: + lines.extend( + [ + f"## {item['scenario']['name']}", + "", + f"- Direction: `{item['scenario']['source_lang']} -> {item['scenario']['target_lang']}`", + f"- Column: `{item['scenario']['column']}`", + f"- Loaded rows: `{item['dataset']['rows_loaded']}`", + f"- Load time: `{item['runtime_defaults']['load_seconds']} s`", + f"- Device: `{item['runtime_defaults']['device']}`", + f"- DType: `{item['runtime_defaults']['torch_dtype']}`", + f"- Cache disabled: `{item['config']['cache_disabled']}`", + "", + ] + ) + lines.extend(render_case_table("Batch Sweep (`concurrency=1`)", item["batch_sweep"], include_batch=True, include_concurrency=False)) + lines.extend( + render_case_table( + f"Concurrency Sweep (`batch_size={item['config']['concurrency_batch_size']}`)", + item["concurrency_sweep"], + include_batch=False, + include_concurrency=True, + ) + ) + lines.extend(render_case_table("Batch x Concurrency Matrix", item["matrix"], include_batch=True, include_concurrency=True)) + return "\n".join(lines) + + +def render_markdown_report(report: Dict[str, Any]) -> str: + if report["suite"] == "extended": + return render_extended_markdown_report(report) + return render_baseline_markdown_report(report) + + def main() -> None: args = parse_args() if args.single: - result = benchmark_single_scenario(args) + if args.suite == "extended": + result = benchmark_extended_scenario(args) + else: + result = benchmark_single_scenario(args) print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False)) return report = run_all_scenarios(args) output_dir = resolve_output_dir(args.output_dir) timestamp = datetime.now().strftime("%H%M%S") - json_path = output_dir / f"translation_local_models_{timestamp}.json" - md_path = output_dir / f"translation_local_models_{timestamp}.md" + suffix = "extended" if args.suite == "extended" else "baseline" + json_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.json" + md_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.md" json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") md_path.write_text(render_markdown_report(report), encoding="utf-8") print(f"JSON report: {json_path}") print(f"Markdown report: {md_path}") for item in report["scenarios"]: - runtime = item["runtime"] - print( - f"{item['scenario']['name']}: " - f"{runtime['items_per_second']} items/s | " - f"avg_item={runtime['avg_item_latency_ms']} ms | " - f"p95_batch={runtime['batch_latency_p95_ms']} ms | " - f"load={runtime['load_seconds']} s" - ) + if args.suite == "extended": + best_batch = max(item["batch_sweep"], key=lambda x: x["items_per_second"]) + best_concurrency = max(item["concurrency_sweep"], key=lambda x: x["items_per_second"]) + print( + f"{item['scenario']['name']}: " + f"best_batch={best_batch['batch_size']} ({best_batch['items_per_second']} items/s) | " + f"best_concurrency={best_concurrency['concurrency']} ({best_concurrency['items_per_second']} items/s @ batch={best_concurrency['batch_size']})" + ) + else: + runtime = item["runtime"] + print( + f"{item['scenario']['name']}: " + f"{runtime['items_per_second']} items/s | " + f"avg_item={runtime['avg_item_latency_ms']} ms | " + f"p95_req={runtime['request_latency_p95_ms']} ms | " + f"load={runtime['load_seconds']} s" + ) if __name__ == "__main__": diff --git a/translation/README.md b/translation/README.md index 6d09d26..db780c8 100644 --- a/translation/README.md +++ b/translation/README.md @@ -13,7 +13,7 @@ - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh) - 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py) - 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) -- 性能报告:[`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) +- 性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) ## 1. 设计目标 @@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \ 数据集: - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) +最新报告: +- 摘要:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) +- 完整 Markdown:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) +- 完整 JSON:[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) + +### 10.1 先看哪组数据 + +这里把 3 类结果分开看,不再混在一张表里: + +- `batch_sweep` + 固定 `concurrency=1`,只比较不同 `batch_size` 的单流批处理性能 +- `concurrency_sweep` + 固定 `batch_size=1`,看“单条请求”在不同并发下的延迟和吞吐 +- `batch x concurrency matrix` + 同时看 `batch_size` 和 `concurrency` 的交互效应;本轮限制为 `batch_size * concurrency <= 128` + +建议: + +- 看线上 query 翻译延迟:优先看 `concurrency_sweep` +- 看离线批量翻译吞吐:优先看 `batch_sweep` +- 看单 worker 服务容量边界:再看 `batch x concurrency matrix` + +### 10.2 本轮补测参数 + +测试时间:`2026-03-18` + +环境: +- GPU:`Tesla T4 16GB` +- Python env:`.venv-translator` +- Torch / Transformers:`2.10.0+cu128 / 5.3.0` + +统一参数: +- cache:关闭(`--disable-cache`),避免缓存命中干扰性能结果 +- `batch_sweep`:每档 `256` items +- `concurrency_sweep`:固定 `batch_size=1`,每档 `32` requests +- `batch x concurrency matrix`:每档 `32` requests,且只保留 `batch_size * concurrency <= 128` +- 预热:`1` batch + 复现命令: ```bash @@ -537,16 +575,36 @@ cd /data/saas-search ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py ``` -单模型复现示例: +本轮扩展压测复现命令: + +```bash +cd /data/saas-search +./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ + --suite extended \ + --disable-cache \ + --serial-items-per-case 256 \ + --concurrency-requests-per-case 32 \ + --concurrency-batch-size 1 \ + --output-dir perf_reports/20260318/translation_local_models +``` + +单模型扩展压测示例: ```bash ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ --single \ + --suite extended \ --model opus-mt-zh-en \ --source-lang zh \ --target-lang en \ --column title_cn \ - --scene sku_name + --scene sku_name \ + --disable-cache \ + --batch-size-list 1,4,8,16,32,64 \ + --concurrency-list 1,2,4,8,16,64 \ + --serial-items-per-case 256 \ + --concurrency-requests-per-case 32 \ + --concurrency-batch-size 1 ``` 单条请求延迟复现: @@ -554,37 +612,143 @@ cd /data/saas-search ```bash ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ --single \ + --suite extended \ --model nllb-200-distilled-600m \ --source-lang zh \ --target-lang en \ --column title_cn \ --scene sku_name \ - --batch-size 1 \ - --limit 100 + --disable-cache \ + --batch-size-list 1 \ + --concurrency-list 1,2,4,8,16,64 \ + --serial-items-per-case 256 \ + --concurrency-requests-per-case 32 \ + --concurrency-batch-size 1 ``` -说明: -- 对当前脚本和本地 backend 来说,“单条请求”可以直接等价理解为 `batch_size=1` -- 此时脚本里的 `batch_latency_*`,就可以直接视为“单次请求延迟”指标 -- 线上搜索 query 翻译更应该关注这组数据,而不是大 batch 吞吐 +### 10.3 单流 batch 结果 + +这组只看 `concurrency=1`,不要把这里的 `request p95` 当作线上并发请求的 p95。 + +`nllb-200-distilled-600m zh -> en` + +| Batch | Items/s | Avg item ms | Req p95 ms | +|---:|---:|---:|---:| +| 1 | 2.91 | 343.488 | 616.27 | +| 4 | 8.44 | 118.545 | 722.95 | +| 8 | 14.85 | 67.335 | 728.47 | +| 16 | 27.28 | 36.662 | 769.18 | +| 32 | 38.6 | 25.908 | 1369.88 | +| 64 | 58.3 | 17.152 | 1659.9 | + +`nllb-200-distilled-600m en -> zh` + +| Batch | Items/s | Avg item ms | Req p95 ms | +|---:|---:|---:|---:| +| 1 | 1.91 | 524.917 | 866.33 | +| 4 | 4.94 | 202.473 | 1599.74 | +| 8 | 8.25 | 121.188 | 1632.29 | +| 16 | 13.52 | 73.956 | 1649.65 | +| 32 | 21.27 | 47.017 | 1827.16 | +| 64 | 32.64 | 30.641 | 2031.25 | -当前单条请求实测(`Tesla T4`,`limit=100`): -- `nllb-200-distilled-600m zh->en`:p50 约 `292.54 ms`,p95 约 `624.12 ms`,平均约 `321.91 ms` -- `nllb-200-distilled-600m en->zh`:p50 约 `481.61 ms`,p95 约 `1171.71 ms`,平均约 `542.47 ms` +`opus-mt-zh-en zh -> en` -当前压测环境: -- GPU:`Tesla T4 16GB` -- Python env:`.venv-translator` -- 数据量:`18,576` 条商品标题 +| Batch | Items/s | Avg item ms | Req p95 ms | +|---:|---:|---:|---:| +| 1 | 6.15 | 162.536 | 274.74 | +| 4 | 15.34 | 65.192 | 356.0 | +| 8 | 25.51 | 39.202 | 379.84 | +| 16 | 41.44 | 24.129 | 797.93 | +| 32 | 54.36 | 18.397 | 1693.31 | +| 64 | 70.15 | 14.255 | 2161.59 | + +`opus-mt-en-zh en -> zh` + +| Batch | Items/s | Avg item ms | Req p95 ms | +|---:|---:|---:|---:| +| 1 | 4.53 | 220.598 | 411.57 | +| 4 | 10.12 | 98.844 | 761.49 | +| 8 | 14.63 | 68.361 | 1930.85 | +| 16 | 24.33 | 41.1 | 2098.54 | +| 32 | 33.91 | 29.487 | 2152.28 | +| 64 | 42.47 | 23.547 | 2371.85 | + +批处理结论: + +- 纯吞吐看,4 个方向的最佳 raw throughput 都出现在 `batch_size=64` +- 如果还要兼顾单个 batch 的尾延迟,`batch_size=16` 往往更均衡 +- `opus-mt-zh-en` 是本轮 bulk 场景最快模型,`nllb en->zh` 最慢 + +### 10.4 单条请求并发结果 + +这组固定 `batch_size=1`,可以直接理解成“单条请求在不同并发下的表现”。 + +`nllb-200-distilled-600m zh -> en` + +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | +|---:|---:|---:|---:|---:| +| 1 | 4.17 | 239.99 | 226.34 | 373.27 | +| 2 | 4.1 | 477.99 | 459.36 | 703.96 | +| 4 | 4.1 | 910.74 | 884.71 | 1227.01 | +| 8 | 4.04 | 1697.73 | 1818.48 | 2383.8 | +| 16 | 4.07 | 2801.91 | 3473.63 | 4145.92 | +| 64 | 4.04 | 3714.49 | 3610.08 | 7337.3 | + +`nllb-200-distilled-600m en -> zh` + +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | +|---:|---:|---:|---:|---:| +| 1 | 2.16 | 463.18 | 439.54 | 670.78 | +| 2 | 2.15 | 920.48 | 908.27 | 1213.3 | +| 4 | 2.16 | 1759.87 | 1771.58 | 2158.04 | +| 8 | 2.15 | 3284.44 | 3658.45 | 3971.01 | +| 16 | 2.14 | 5669.15 | 7117.7 | 7522.48 | +| 64 | 2.14 | 7631.14 | 7510.97 | 14139.03 | + +`opus-mt-zh-en zh -> en` + +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | +|---:|---:|---:|---:|---:| +| 1 | 9.21 | 108.53 | 91.7 | 179.12 | +| 2 | 8.92 | 219.19 | 212.29 | 305.34 | +| 4 | 9.09 | 411.76 | 420.08 | 583.97 | +| 8 | 8.85 | 784.14 | 835.73 | 1043.06 | +| 16 | 9.01 | 1278.4 | 1483.34 | 1994.56 | +| 64 | 8.82 | 1687.08 | 1563.48 | 3381.58 | + +`opus-mt-en-zh en -> zh` + +| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms | +|---:|---:|---:|---:|---:| +| 1 | 3.6 | 277.73 | 145.85 | 1180.37 | +| 2 | 3.55 | 559.38 | 346.71 | 1916.96 | +| 4 | 3.53 | 997.71 | 721.04 | 2944.17 | +| 8 | 3.51 | 1644.28 | 1590.93 | 3632.99 | +| 16 | 3.5 | 2600.18 | 2586.34 | 5554.04 | +| 64 | 3.52 | 3366.52 | 2780.0 | 7950.41 | -最终性能结果: +并发结论: + +- 当前本地 seq2seq backend 内部是单模型锁,单 worker 下提高客户端并发基本不会提升吞吐,主要会把等待时间堆到请求延迟上 +- 线上 query 翻译如果追求稳定延迟,应优先控制在低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 +- 在线场景里,`opus-mt-zh-en` 延迟最稳;`nllb en->zh` 最慢,且并发放大后尾延迟最明显 + +### 10.5 batch x concurrency 怎么看 + +完整矩阵见: +- [`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) + +这张表主要回答两个问题: -| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | -|---|---|---:|---:|---:|---:|---:|---:|---:|---:| -| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | -| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | -| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 | -| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 | +- 如果已经知道自己要跑离线批处理,`batch_size` 拉大后,在不同并发下吞吐会不会继续涨 +- 如果要拿单 worker 服务扛请求,在哪个 `batch_size x concurrency` 组合下开始明显排队 + +本轮矩阵的共同特征: + +- 吞吐主要由 `batch_size` 决定,`concurrency` 不是主要增益来源 +- 在 `batch_size` 固定时,`concurrency` 从 `1` 升到 `2/4/8/...`,`items/s` 变化很小,但 `avg req ms / p95` 会持续抬升 +- 因此当前实现更像“单 worker + 内部串行 GPU 推理服务”,不是一个靠客户端并发放大吞吐的服务 NLLB 性能优化经验: @@ -632,7 +796,7 @@ NLLB 性能优化经验: - 运行方式:单 worker,避免重复加载 更详细的性能说明见: -- [`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md) +- [`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) ## 11. 开发说明 -- libgit2 0.21.2