更新了压测脚本和文档，让“单条请求 / 单流 batch / 并发 /

batch×并发矩阵”彻底分开展示。改动在这几处： scripts/benchmark_translation_local_models.py：新增 --suite extended，支持 batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size * concurrency <= 128 的组合矩阵；并且单场景模式现在只加载目标模型，load_seconds 更干净，也支持 --disable-cache。 translation/README.md：把性能章节拆成了 batch_sweep、concurrency_sweep、batch x concurrency matrix 三块，补了这次复测的参数、复现命令和摘要表。 perf_reports/20260318/translation_local_models/README.md：新增本轮补测摘要。完整结果在 translation_local_models_extended_221846.md 和 translation_local_models_extended_221846.json。这次补测的核心结论很明确：在线单条请求应该看 concurrency_sweep，也就是固定 batch_size=1 的表。离线批量吞吐应该看 batch_sweep，4 个方向的最高 raw throughput 都出现在 batch_size=64，但更均衡的默认值仍更像 batch_size=16。当前本地 seq2seq backend 有单模型锁，提升客户端并发几乎不涨吞吐，主要是把排队时间变成更高的 p95；所以并发更像“延迟预算”问题，不是“扩容吞吐”手段。本轮在线单条里最快的是 opus-mt-zh-en；最慢、且并发放大最明显的是 nllb-200-distilled-600m en->zh。

更新了压测脚本和文档，让“单条请求 / 单流 batch / 并发 /
batch×并发矩阵”彻底分开展示。改动在这几处： scripts/benchmark_translation_local_models.py：新增 --suite extended，支持 batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size * concurrency <= 128 的组合矩阵；并且单场景模式现在只加载目标模型，load_seconds 更干净，也支持 --disable-cache。 translation/README.md：把性能章节拆成了 batch_sweep、concurrency_sweep、batch x concurrency matrix 三块，补了这次复测的参数、复现命令和摘要表。 perf_reports/20260318/translation_local_models/README.md：新增本轮补测摘要。完整结果在 translation_local_models_extended_221846.md 和 translation_local_models_extended_221846.json。这次补测的核心结论很明确：在线单条请求应该看 concurrency_sweep，也就是固定 batch_size=1 的表。离线批量吞吐应该看 batch_sweep，4 个方向的最高 raw throughput 都出现在 batch_size=64，但更均衡的默认值仍更像 batch_size=16。当前本地 seq2seq backend 有单模型锁，提升客户端并发几乎不涨吞吐，主要是把排队时间变成更高的 p95；所以并发更像“延迟预算”问题，不是“扩容吞吐”手段。本轮在线单条里最快的是 opus-mt-zh-en；最慢、且并发放大最明显的是 nllb-200-distilled-600m en->zh。
tangwang
1 parent cd4ce66d
Showing 17 changed files with 1138 additions and 117 deletions Show diff stats
docs/TODO.txt
perf_reports/20260318/translation_local_models/README.md
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs1.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs16.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs32.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs4.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs64.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_en_zh_bs8.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs1.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs16.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs32.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs4.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs64.jsonl
perf_reports/20260318/translation_local_models/raw/nllb_zh_en_bs8.jsonl
perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md
scripts/benchmark_translation_local_models.py
translation/README.md
-product_enrich : Partial Mode
+
+
+nllb-200-distilled-600M性能优化
+请搜索nllb-200-distilled-600M这类seq2seq、transformer架构的模型，有哪些性能优化方案，提高线上翻译服务的吞吐量、降低耗时，搜索相关的在线推理服务方案，找到高性能的服务化方法
+
+cnclip的性能优化
+
+rerank 性能优化
+
+
+超时
+Query 分析阶段等待翻译/embedding 的硬超时
+配置文件位置：config/config.yaml
+配置项：query_config.async_wait_timeout_ms: 80
+代码生效点：query/query_parser.py 使用该值换算成秒传给 wait(...)
+2) Embedding HTTP 调用超时（Text/Image）
+不再使用任何环境变量覆盖（之前提到的 EMBEDDING_HTTP_TIMEOUT_SEC 已不采用）
+配置文件位置：config/config.yaml
+配置项：services.embedding.providers.http.timeout_sec（已在 YAML 里补了示例默认 60）
+代码生效点：
+embeddings/text_encoder.py：requests.post(..., timeout=self.timeout_sec)
+embeddings/image_encoder.py：requests.post(..., timeout=self.timeout_sec)
+
+
+
+
+product_enrich : Partial Mode   :   done
 https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-menu-2400256.d_0_3_0_7.74a630119Ct6zR
 需在messages 数组中将最后一条消息的 role 设置为 assistant，并在其 content 中提供前缀，在此消息中设置参数 "partial": true。messages格式如下：
 [
@@ -15,7 +41,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men
     }
 ]
 模型会以前缀内容为起点开始生成。
-
 支持 非思考模式。
@@ -41,12 +66,6 @@ https://help.aliyun.com/zh/model-studio/partial-mode?spm=a2c4g.11186623.help-men
-翻译的cache需要重构
-
-
-
-
-
 suggest 索引，现在是全量脚本，要交给金伟
@@ -0,0 +1,101 @@
+# Local Translation Model Benchmark Report
+
+测试脚本：
+- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
+
+完整结果：
+- Markdown：[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
+- JSON：[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
+
+测试时间：
+- `2026-03-18`
+
+环境：
+- GPU：`Tesla T4 16GB`
+- Driver / CUDA：`570.158.01 / 12.8`
+- Python env：`.venv-translator`
+- Torch / Transformers：`2.10.0+cu128 / 5.3.0`
+- 数据集：[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
+
+## Method
+
+这轮把结果拆成 3 类：
+
+- `batch_sweep`
+  固定 `concurrency=1`，比较 `batch_size=1/4/8/16/32/64`
+- `concurrency_sweep`
+  固定 `batch_size=1`，比较 `concurrency=1/2/4/8/16/64`
+- `batch x concurrency matrix`
+  组合压测，保留 `batch_size * concurrency <= 128`
+
+统一设定：
+- 关闭 cache：`--disable-cache`
+- `batch_sweep`：每档 `256` items
+- `concurrency_sweep`：每档 `32` requests
+- `matrix`：每档 `32` requests
+- 预热：`1` batch
+
+复现命令：
+
+```bash
+cd /data/saas-search
+./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
+  --suite extended \
+  --disable-cache \
+  --serial-items-per-case 256 \
+  --concurrency-requests-per-case 32 \
+  --concurrency-batch-size 1 \
+  --output-dir perf_reports/20260318/translation_local_models
+```
+
+## Key Results
+
+### 1. 单流 batch sweep
+
+| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
+|---|---|---:|---:|---:|---:|
+| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
+| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
+| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
+| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
+
+解读：
+- 纯吞吐上，4 个方向都在 `batch_size=64` 达到最高
+- 如果还要兼顾更平衡的单 batch 延迟，`batch_size=16` 更适合作为默认 bulk 配置候选
+- 本轮 bulk 吞吐排序：`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
+
+### 2. 单条请求并发 sweep
+
+| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
+|---|---|---:|---:|---:|---:|
+| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
+| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
+| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
+| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
+
+解读：
+- `batch_size=1` 下，提高客户端并发几乎不提高吞吐，主要是把等待时间转成请求延迟
+- 在线 query 翻译更适合低并发；`8+` 并发后，4 个方向的 p95 都明显恶化
+- 在线场景里最稳的是 `opus-mt-zh-en`；最重的是 `nllb-200-distilled-600m en->zh`
+
+### 3. batch x concurrency matrix
+
+最高吞吐组合（在 `batch_size * concurrency <= 128` 约束下）：
+
+| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
+|---|---|---:|---:|---:|---:|---:|
+| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
+| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
+| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
+| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
+
+解读：
+- 当前实现里，吞吐主要由 `batch_size` 决定，不是由 `concurrency` 决定
+- 同一 `batch_size` 下，把并发从 `1` 拉高到 `2/4/8/...`，吞吐变化很小，但请求延迟会明显抬升
+- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型；容量规划不能指望靠客户端并发换吞吐
+
+## Recommendation
+
+- 在线 query 翻译优先看 `concurrency_sweep`，并把 `batch_size=1` 作为主指标口径
+- 离线批量翻译优先看 `batch_sweep`，默认从 `batch_size=16` 起步，再按吞吐目标决定是否升到 `32/64`
+- 如果继续沿用当前单 worker 架构，应把“允许的并发数”视为延迟预算问题，而不是吞吐扩容手段
@@ -0,0 +1,263 @@
+# Local Translation Model Extended Benchmark
+
+- Generated at: `2026-03-18T21:28:09`
+- Suite: `extended`
+- Python: `3.12.3`
+- Torch: `2.10.0+cu128`
+- Transformers: `5.3.0`
+- CUDA: `True`
+- GPU: `Tesla T4` (15.56 GiB)
+
+## Reading Guide
+
+- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.
+- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.
+- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.
+
+## nllb-200-distilled-600m zh->en
+
+- Direction: `zh -> en`
+- Column: `title_cn`
+- Loaded rows: `2048`
+- Load time: `6.118 s`
+- Device: `cuda`
+- DType: `torch.float16`
+- Cache disabled: `True`
+
+### Batch Sweep (`concurrency=1`)
+
+| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 256 | 256 | 2.91 | 2.91 | 343.48 | 315.98 | 616.27 | 2.633 |
+| 4 | 256 | 64 | 8.44 | 2.11 | 474.17 | 474.75 | 722.95 | 2.659 |
+| 8 | 256 | 32 | 14.85 | 1.86 | 538.68 | 564.97 | 728.47 | 2.699 |
+| 16 | 256 | 16 | 27.28 | 1.7 | 586.59 | 633.16 | 769.18 | 2.765 |
+| 32 | 256 | 8 | 38.6 | 1.21 | 829.04 | 761.74 | 1369.88 | 2.987 |
+| 64 | 256 | 4 | 58.3 | 0.91 | 1097.73 | 919.35 | 1659.9 | 3.347 |
+
+### Concurrency Sweep (`batch_size=1`)
+
+| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 32 | 32 | 4.17 | 4.17 | 239.99 | 226.34 | 373.27 | 3.347 |
+| 2 | 32 | 32 | 4.1 | 4.1 | 477.99 | 459.36 | 703.96 | 3.347 |
+| 4 | 32 | 32 | 4.1 | 4.1 | 910.74 | 884.71 | 1227.01 | 3.347 |
+| 8 | 32 | 32 | 4.04 | 4.04 | 1697.73 | 1818.48 | 2383.8 | 3.347 |
+| 16 | 32 | 32 | 4.07 | 4.07 | 2801.91 | 3473.63 | 4145.92 | 3.347 |
+| 64 | 32 | 32 | 4.04 | 4.04 | 3714.49 | 3610.08 | 7337.3 | 3.347 |
+
+### Batch x Concurrency Matrix
+
+| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 1 | 32 | 32 | 3.94 | 3.94 | 253.96 | 231.8 | 378.76 | 3.347 |
+| 1 | 2 | 32 | 32 | 3.92 | 3.92 | 500.35 | 480.18 | 747.3 | 3.347 |
+| 1 | 4 | 32 | 32 | 4.05 | 4.05 | 922.53 | 894.57 | 1235.7 | 3.347 |
+| 1 | 8 | 32 | 32 | 4.08 | 4.08 | 1679.71 | 1806.95 | 2346.11 | 3.347 |
+| 1 | 16 | 32 | 32 | 4.05 | 4.05 | 2816.51 | 3485.68 | 4181.39 | 3.347 |
+| 1 | 64 | 32 | 32 | 4.06 | 4.06 | 3711.9 | 3614.34 | 7319.52 | 3.347 |
+| 4 | 1 | 128 | 32 | 9.12 | 2.28 | 438.42 | 386.12 | 724.47 | 3.347 |
+| 4 | 2 | 128 | 32 | 9.2 | 2.3 | 852.37 | 756.47 | 1366.7 | 3.347 |
+| 4 | 4 | 128 | 32 | 9.17 | 2.29 | 1642.9 | 1524.56 | 2662.49 | 3.347 |
+| 4 | 8 | 128 | 32 | 9.18 | 2.29 | 2974.7 | 3168.19 | 4594.99 | 3.347 |
+| 4 | 16 | 128 | 32 | 9.23 | 2.31 | 4897.43 | 5815.91 | 8210.23 | 3.347 |
+| 8 | 1 | 256 | 32 | 14.73 | 1.84 | 543.03 | 556.97 | 734.67 | 3.347 |
+| 8 | 2 | 256 | 32 | 14.73 | 1.84 | 1064.16 | 1132.78 | 1405.53 | 3.347 |
+| 8 | 4 | 256 | 32 | 14.79 | 1.85 | 2035.46 | 2237.69 | 2595.53 | 3.347 |
+| 8 | 8 | 256 | 32 | 14.76 | 1.84 | 3763.14 | 4345.6 | 5025.02 | 3.347 |
+| 8 | 16 | 256 | 32 | 14.77 | 1.85 | 6383.08 | 8053.36 | 9380.87 | 3.347 |
+| 16 | 1 | 512 | 32 | 24.88 | 1.55 | 643.11 | 661.91 | 814.26 | 3.347 |
+| 16 | 2 | 512 | 32 | 24.91 | 1.56 | 1265.17 | 1326.85 | 1576.96 | 3.347 |
+| 16 | 4 | 512 | 32 | 24.94 | 1.56 | 2446.55 | 2594.26 | 3027.37 | 3.347 |
+| 16 | 8 | 512 | 32 | 24.8 | 1.55 | 4548.92 | 5035.56 | 5852.87 | 3.347 |
+| 32 | 1 | 1024 | 32 | 34.9 | 1.09 | 916.78 | 823.68 | 1637.59 | 3.347 |
+| 32 | 2 | 1024 | 32 | 34.98 | 1.09 | 1807.39 | 1717.89 | 2514.27 | 3.347 |
+| 32 | 4 | 1024 | 32 | 34.85 | 1.09 | 3513.42 | 3779.99 | 4750.34 | 3.347 |
+| 64 | 1 | 2048 | 32 | 53.24 | 0.83 | 1202.07 | 993.59 | 1838.72 | 3.642 |
+| 64 | 2 | 2048 | 32 | 53.95 | 0.84 | 2344.92 | 2465.53 | 3562.04 | 3.642 |
+
+## nllb-200-distilled-600m en->zh
+
+- Direction: `en -> zh`
+- Column: `title`
+- Loaded rows: `2048`
+- Load time: `6.137 s`
+- Device: `cuda`
+- DType: `torch.float16`
+- Cache disabled: `True`
+
+### Batch Sweep (`concurrency=1`)
+
+| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 256 | 256 | 1.91 | 1.91 | 524.91 | 497.81 | 866.33 | 2.636 |
+| 4 | 256 | 64 | 4.94 | 1.23 | 809.89 | 743.87 | 1599.74 | 2.684 |
+| 8 | 256 | 32 | 8.25 | 1.03 | 969.5 | 796.64 | 1632.29 | 2.723 |
+| 16 | 256 | 16 | 13.52 | 0.85 | 1183.29 | 1169.28 | 1649.65 | 2.817 |
+| 32 | 256 | 8 | 21.27 | 0.66 | 1504.54 | 1647.92 | 1827.16 | 3.007 |
+| 64 | 256 | 4 | 32.64 | 0.51 | 1961.02 | 1985.8 | 2031.25 | 3.408 |
+
+### Concurrency Sweep (`batch_size=1`)
+
+| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 32 | 32 | 2.16 | 2.16 | 463.18 | 439.54 | 670.78 | 3.408 |
+| 2 | 32 | 32 | 2.15 | 2.15 | 920.48 | 908.27 | 1213.3 | 3.408 |
+| 4 | 32 | 32 | 2.16 | 2.16 | 1759.87 | 1771.58 | 2158.04 | 3.408 |
+| 8 | 32 | 32 | 2.15 | 2.15 | 3284.44 | 3658.45 | 3971.01 | 3.408 |
+| 16 | 32 | 32 | 2.14 | 2.14 | 5669.15 | 7117.7 | 7522.48 | 3.408 |
+| 64 | 32 | 32 | 2.14 | 2.14 | 7631.14 | 7510.97 | 14139.03 | 3.408 |
+
+### Batch x Concurrency Matrix
+
+| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 1 | 32 | 32 | 2.17 | 2.17 | 461.18 | 439.28 | 673.59 | 3.408 |
+| 1 | 2 | 32 | 32 | 2.15 | 2.15 | 919.12 | 904.83 | 1209.83 | 3.408 |
+| 1 | 4 | 32 | 32 | 2.14 | 2.14 | 1771.34 | 1810.99 | 2171.71 | 3.408 |
+| 1 | 8 | 32 | 32 | 2.12 | 2.12 | 3324.85 | 3704.57 | 3999.29 | 3.408 |
+| 1 | 16 | 32 | 32 | 2.15 | 2.15 | 5651.09 | 7121.8 | 7495.67 | 3.408 |
+| 1 | 64 | 32 | 32 | 2.15 | 2.15 | 7619.16 | 7505.14 | 14111.53 | 3.408 |
+| 4 | 1 | 128 | 32 | 5.13 | 1.28 | 780.0 | 661.74 | 1586.9 | 3.408 |
+| 4 | 2 | 128 | 32 | 5.09 | 1.27 | 1551.81 | 1368.81 | 2775.92 | 3.408 |
+| 4 | 4 | 128 | 32 | 5.13 | 1.28 | 2995.57 | 2836.77 | 4532.67 | 3.408 |
+| 4 | 8 | 128 | 32 | 5.16 | 1.29 | 5619.47 | 6046.81 | 7767.22 | 3.408 |
+| 4 | 16 | 128 | 32 | 5.15 | 1.29 | 9546.3 | 12146.68 | 14537.41 | 3.408 |
+| 8 | 1 | 256 | 32 | 8.36 | 1.04 | 957.45 | 784.85 | 1591.01 | 3.408 |
+| 8 | 2 | 256 | 32 | 8.29 | 1.04 | 1906.56 | 1847.53 | 2799.1 | 3.408 |
+| 8 | 4 | 256 | 32 | 8.3 | 1.04 | 3691.59 | 3773.67 | 4849.58 | 3.408 |
+| 8 | 8 | 256 | 32 | 8.29 | 1.04 | 6954.22 | 7681.7 | 8904.0 | 3.408 |
+| 8 | 16 | 256 | 32 | 8.31 | 1.04 | 11923.83 | 14903.17 | 17072.19 | 3.408 |
+| 16 | 1 | 512 | 32 | 13.83 | 0.86 | 1156.84 | 904.4 | 1609.17 | 3.408 |
+| 16 | 2 | 512 | 32 | 13.91 | 0.87 | 2276.35 | 2356.15 | 3121.45 | 3.408 |
+| 16 | 4 | 512 | 32 | 13.89 | 0.87 | 4440.77 | 4733.34 | 5488.65 | 3.408 |
+| 16 | 8 | 512 | 32 | 13.83 | 0.86 | 8428.17 | 9407.5 | 10409.59 | 3.408 |
+| 32 | 1 | 1024 | 32 | 23.28 | 0.73 | 1374.39 | 1603.35 | 1806.99 | 3.408 |
+| 32 | 2 | 1024 | 32 | 23.26 | 0.73 | 2721.28 | 2607.91 | 3437.08 | 3.408 |
+| 32 | 4 | 1024 | 32 | 23.18 | 0.72 | 5297.48 | 5234.53 | 6758.76 | 3.408 |
+| 64 | 1 | 2048 | 32 | 34.97 | 0.55 | 1829.91 | 1899.54 | 2039.18 | 3.69 |
+| 64 | 2 | 2048 | 32 | 34.78 | 0.54 | 3615.7 | 3795.36 | 4014.19 | 3.69 |
+
+## opus-mt-zh-en zh->en
+
+- Direction: `zh -> en`
+- Column: `title_cn`
+- Loaded rows: `2048`
+- Load time: `3.2561 s`
+- Device: `cuda`
+- DType: `torch.float16`
+- Cache disabled: `True`
+
+### Batch Sweep (`concurrency=1`)
+
+| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 256 | 256 | 6.15 | 6.15 | 162.53 | 145.96 | 274.74 | 0.285 |
+| 4 | 256 | 64 | 15.34 | 3.83 | 260.77 | 231.11 | 356.0 | 0.306 |
+| 8 | 256 | 32 | 25.51 | 3.19 | 313.61 | 271.09 | 379.84 | 0.324 |
+| 16 | 256 | 16 | 41.44 | 2.59 | 386.06 | 286.4 | 797.93 | 0.366 |
+| 32 | 256 | 8 | 54.36 | 1.7 | 588.7 | 330.13 | 1693.31 | 0.461 |
+| 64 | 256 | 4 | 70.15 | 1.1 | 912.33 | 447.93 | 2161.59 | 0.642 |
+
+### Concurrency Sweep (`batch_size=1`)
+
+| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 32 | 32 | 9.21 | 9.21 | 108.53 | 91.7 | 179.12 | 0.642 |
+| 2 | 32 | 32 | 8.92 | 8.92 | 219.19 | 212.29 | 305.34 | 0.642 |
+| 4 | 32 | 32 | 9.09 | 9.09 | 411.76 | 420.08 | 583.97 | 0.642 |
+| 8 | 32 | 32 | 8.85 | 8.85 | 784.14 | 835.73 | 1043.06 | 0.642 |
+| 16 | 32 | 32 | 9.01 | 9.01 | 1278.4 | 1483.34 | 1994.56 | 0.642 |
+| 64 | 32 | 32 | 8.82 | 8.82 | 1687.08 | 1563.48 | 3381.58 | 0.642 |
+
+### Batch x Concurrency Matrix
+
+| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 1 | 32 | 32 | 9.01 | 9.01 | 110.96 | 97.37 | 191.91 | 0.642 |
+| 1 | 2 | 32 | 32 | 9.2 | 9.2 | 212.43 | 191.27 | 299.65 | 0.642 |
+| 1 | 4 | 32 | 32 | 8.86 | 8.86 | 423.8 | 423.61 | 596.78 | 0.642 |
+| 1 | 8 | 32 | 32 | 9.04 | 9.04 | 764.39 | 810.05 | 1035.74 | 0.642 |
+| 1 | 16 | 32 | 32 | 9.06 | 9.06 | 1272.52 | 1468.35 | 1998.99 | 0.642 |
+| 1 | 64 | 32 | 32 | 9.04 | 9.04 | 1663.7 | 1556.13 | 3296.1 | 0.642 |
+| 4 | 1 | 128 | 32 | 15.17 | 3.79 | 263.61 | 201.11 | 369.93 | 0.642 |
+| 4 | 2 | 128 | 32 | 15.25 | 3.81 | 517.25 | 397.13 | 1319.45 | 0.642 |
+| 4 | 4 | 128 | 32 | 15.27 | 3.82 | 899.61 | 746.76 | 1914.27 | 0.642 |
+| 4 | 8 | 128 | 32 | 15.39 | 3.85 | 1539.87 | 1466.41 | 2918.85 | 0.642 |
+| 4 | 16 | 128 | 32 | 15.48 | 3.87 | 2455.89 | 2728.17 | 4644.99 | 0.642 |
+| 8 | 1 | 256 | 32 | 25.44 | 3.18 | 314.41 | 272.16 | 386.05 | 0.642 |
+| 8 | 2 | 256 | 32 | 25.42 | 3.18 | 620.41 | 541.88 | 1418.79 | 0.642 |
+| 8 | 4 | 256 | 32 | 24.96 | 3.12 | 1226.44 | 1070.41 | 2940.56 | 0.642 |
+| 8 | 8 | 256 | 32 | 25.44 | 3.18 | 2252.81 | 2148.42 | 4143.66 | 0.642 |
+| 8 | 16 | 256 | 32 | 25.44 | 3.18 | 3961.06 | 5028.31 | 6340.79 | 0.642 |
+| 16 | 1 | 512 | 32 | 31.66 | 1.98 | 505.38 | 321.69 | 1963.58 | 0.653 |
+| 16 | 2 | 512 | 32 | 31.41 | 1.96 | 1009.12 | 677.84 | 2341.01 | 0.653 |
+| 16 | 4 | 512 | 32 | 31.17 | 1.95 | 2006.54 | 1562.68 | 4299.85 | 0.653 |
+| 16 | 8 | 512 | 32 | 31.51 | 1.97 | 3483.96 | 4085.09 | 5902.32 | 0.653 |
+| 32 | 1 | 1024 | 32 | 38.66 | 1.21 | 827.7 | 409.78 | 2162.48 | 0.748 |
+| 32 | 2 | 1024 | 32 | 38.75 | 1.21 | 1613.28 | 916.1 | 3228.1 | 0.748 |
+| 32 | 4 | 1024 | 32 | 38.66 | 1.21 | 3162.05 | 3239.36 | 4992.07 | 0.748 |
+| 64 | 1 | 2048 | 32 | 52.44 | 0.82 | 1220.35 | 533.4 | 2508.12 | 0.939 |
+| 64 | 2 | 2048 | 32 | 52.04 | 0.81 | 2443.4 | 2800.35 | 4765.45 | 0.939 |
+
+## opus-mt-en-zh en->zh
+
+- Direction: `en -> zh`
+- Column: `title`
+- Loaded rows: `2048`
+- Load time: `3.1612 s`
+- Device: `cuda`
+- DType: `torch.float16`
+- Cache disabled: `True`
+
+### Batch Sweep (`concurrency=1`)
+
+| Batch | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 256 | 256 | 4.53 | 4.53 | 220.59 | 177.37 | 411.57 | 0.285 |
+| 4 | 256 | 64 | 10.12 | 2.53 | 395.37 | 319.23 | 761.49 | 0.307 |
+| 8 | 256 | 32 | 14.63 | 1.83 | 546.88 | 390.81 | 1930.85 | 0.326 |
+| 16 | 256 | 16 | 24.33 | 1.52 | 657.6 | 451.22 | 2098.54 | 0.37 |
+| 32 | 256 | 8 | 33.91 | 1.06 | 943.59 | 603.45 | 2152.28 | 0.462 |
+| 64 | 256 | 4 | 42.47 | 0.66 | 1507.01 | 1531.43 | 2371.85 | 0.644 |
+
+### Concurrency Sweep (`batch_size=1`)
+
+| Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 32 | 32 | 3.6 | 3.6 | 277.73 | 145.85 | 1180.37 | 0.644 |
+| 2 | 32 | 32 | 3.55 | 3.55 | 559.38 | 346.71 | 1916.96 | 0.644 |
+| 4 | 32 | 32 | 3.53 | 3.53 | 997.71 | 721.04 | 2944.17 | 0.644 |
+| 8 | 32 | 32 | 3.51 | 3.51 | 1644.28 | 1590.93 | 3632.99 | 0.644 |
+| 16 | 32 | 32 | 3.5 | 3.5 | 2600.18 | 2586.34 | 5554.04 | 0.644 |
+| 64 | 32 | 32 | 3.52 | 3.52 | 3366.52 | 2780.0 | 7950.41 | 0.644 |
+
+### Batch x Concurrency Matrix
+
+| Batch | Concurrency | Rows | Requests | Items/s | Req/s | Avg req ms | Req p50 ms | Req p95 ms | Peak GPU GiB |
+|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1 | 1 | 32 | 32 | 3.54 | 3.54 | 282.22 | 148.26 | 1200.93 | 0.644 |
+| 1 | 2 | 32 | 32 | 3.52 | 3.52 | 563.85 | 346.8 | 1957.63 | 0.644 |
+| 1 | 4 | 32 | 32 | 3.52 | 3.52 | 1002.79 | 728.36 | 2949.99 | 0.644 |
+| 1 | 8 | 32 | 32 | 3.55 | 3.55 | 1611.52 | 1462.84 | 3653.34 | 0.644 |
+| 1 | 16 | 32 | 32 | 3.49 | 3.49 | 2582.77 | 2594.49 | 5504.07 | 0.644 |
+| 1 | 64 | 32 | 32 | 3.53 | 3.53 | 3327.65 | 2753.11 | 7907.54 | 0.644 |
+| 4 | 1 | 128 | 32 | 10.31 | 2.58 | 387.75 | 329.42 | 675.07 | 0.644 |
+| 4 | 2 | 128 | 32 | 10.12 | 2.53 | 784.49 | 713.61 | 1561.27 | 0.644 |
+| 4 | 4 | 128 | 32 | 10.08 | 2.52 | 1543.42 | 1438.42 | 2735.85 | 0.644 |
+| 4 | 8 | 128 | 32 | 10.09 | 2.52 | 2918.06 | 2954.39 | 4327.47 | 0.644 |
+| 4 | 16 | 128 | 32 | 10.16 | 2.54 | 5087.62 | 5734.85 | 7343.34 | 0.644 |
+| 8 | 1 | 256 | 32 | 14.49 | 1.81 | 552.16 | 402.18 | 1963.8 | 0.644 |
+| 8 | 2 | 256 | 32 | 14.54 | 1.82 | 1088.96 | 845.04 | 2565.11 | 0.644 |
+| 8 | 4 | 256 | 32 | 14.48 | 1.81 | 2143.18 | 1832.69 | 4487.97 | 0.644 |
+| 8 | 8 | 256 | 32 | 14.51 | 1.81 | 4051.04 | 3734.16 | 5976.18 | 0.644 |
+| 8 | 16 | 256 | 32 | 14.55 | 1.82 | 6626.6 | 6868.91 | 9395.9 | 0.644 |
+| 16 | 1 | 512 | 32 | 19.11 | 1.19 | 837.36 | 457.51 | 2030.2 | 0.661 |
+| 16 | 2 | 512 | 32 | 19.26 | 1.2 | 1599.87 | 1141.06 | 2802.44 | 0.661 |
+| 16 | 4 | 512 | 32 | 19.1 | 1.19 | 3130.81 | 3166.45 | 4987.49 | 0.661 |
+| 16 | 8 | 512 | 32 | 19.27 | 1.2 | 5735.13 | 5351.82 | 8417.85 | 0.661 |
+| 32 | 1 | 1024 | 32 | 26.04 | 0.81 | 1228.66 | 648.06 | 2167.39 | 0.751 |
+| 32 | 2 | 1024 | 32 | 26.21 | 0.82 | 2372.52 | 2593.74 | 4206.96 | 0.751 |
+| 32 | 4 | 1024 | 32 | 26.11 | 0.82 | 4499.96 | 3977.65 | 6920.57 | 0.751 |
+| 64 | 1 | 2048 | 32 | 34.94 | 0.55 | 1831.48 | 2366.24 | 2473.74 | 0.942 |
+| 64 | 2 | 2048 | 32 | 34.75 | 0.54 | 3603.99 | 3281.6 | 4948.72 | 0.942 |
@@ -4,6 +4,7 @@
 from __future__ import annotations
 import argparse
+import concurrent.futures
 import copy
 import csv
 import json
@@ -16,7 +17,7 @@ import sys
 import time
 from datetime import datetime
 from pathlib import Path
-from typing import Any, Dict, Iterable, List
+from typing import Any, Dict, Iterable, List, Sequence
 import torch
 import transformers
@@ -30,6 +31,9 @@ from translation.service import TranslationService  # noqa: E402
 from translation.settings import get_translation_capability  # noqa: E402
+DEFAULT_BATCH_SIZES = [1, 4, 8, 16, 32, 64]
+DEFAULT_CONCURRENCIES = [1, 2, 4, 8, 16, 64]
+
 SCENARIOS: List[Dict[str, str]] = [
     {
         "name": "nllb-200-distilled-600m zh->en",
@@ -69,7 +73,7 @@ SCENARIOS: List[Dict[str, str]] = [
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Benchmark local translation models")
     parser.add_argument("--csv-path", default="products_analyzed.csv", help="Benchmark dataset CSV path")
-    parser.add_argument("--limit", type=int, default=0, help="Limit rows for faster experiments; 0 means all")
+    parser.add_argument("--limit", type=int, default=0, help="Limit rows for baseline or single-case run; 0 means all")
     parser.add_argument("--output-dir", default="", help="Directory for JSON/Markdown reports")
     parser.add_argument("--single", action="store_true", help="Run a single scenario in-process")
     parser.add_argument("--model", default="", help="Model name for --single mode")
@@ -84,9 +88,67 @@ def parse_args() -&gt; argparse.Namespace:
     parser.add_argument("--num-beams", type=int, default=0, help="Override configured num_beams")
     parser.add_argument("--attn-implementation", default="", help="Override attention implementation, for example sdpa")
     parser.add_argument("--warmup-batches", type=int, default=1, help="Warmup batches before measuring")
+    parser.add_argument("--disable-cache", action="store_true", help="Disable translation cache during benchmarks")
+    parser.add_argument(
+        "--suite",
+        choices=["baseline", "extended"],
+        default="baseline",
+        help="baseline keeps the previous all-scenarios summary; extended adds batch/concurrency/matrix sweeps",
+    )
+    parser.add_argument(
+        "--batch-size-list",
+        default="",
+        help="Comma-separated batch sizes for extended suite; default 1,4,8,16,32,64",
+    )
+    parser.add_argument(
+        "--concurrency-list",
+        default="",
+        help="Comma-separated concurrency levels for extended suite; default 1,2,4,8,16,64",
+    )
+    parser.add_argument(
+        "--serial-items-per-case",
+        type=int,
+        default=512,
+        help="Items per batch-size case in extended suite",
+    )
+    parser.add_argument(
+        "--concurrency-requests-per-case",
+        type=int,
+        default=128,
+        help="Requests per concurrency or matrix case in extended suite",
+    )
+    parser.add_argument(
+        "--concurrency-batch-size",
+        type=int,
+        default=1,
+        help="Batch size used by the dedicated concurrency sweep",
+    )
+    parser.add_argument(
+        "--max-batch-concurrency-product",
+        type=int,
+        default=128,
+        help="Skip matrix cases where batch_size * concurrency exceeds this value; 0 disables the limit",
+    )
     return parser.parse_args()
+def parse_csv_ints(raw: str, fallback: Sequence[int]) -> List[int]:
+    if not raw.strip():
+        return list(fallback)
+    values: List[int] = []
+    for item in raw.split(","):
+        stripped = item.strip()
+        if not stripped:
+            continue
+        value = int(stripped)
+        if value <= 0:
+            raise ValueError(f"Expected positive integer, got {value}")
+        values.append(value)
+    if not values:
+        raise ValueError("Parsed empty integer list")
+    return values
+
+
 def load_texts(csv_path: Path, column: str, limit: int) -> List[str]:
     texts: List[str] = []
     with csv_path.open("r", encoding="utf-8") as handle:
@@ -102,9 +164,9 @@ def load_texts(csv_path: Path, column: str, limit: int) -&gt; List[str]:
     return texts
-def batched(values: List[str], batch_size: int) -> Iterable[List[str]]:
+def batched(values: Sequence[str], batch_size: int) -> Iterable[List[str]]:
     for start in range(0, len(values), batch_size):
-        yield values[start:start + batch_size]
+        yield list(values[start:start + batch_size])
 def percentile(values: List[float], p: float) -> float:
@@ -148,15 +210,34 @@ def build_environment_info() -&gt; Dict[str, Any]:
     }
-def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]:
-    csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
+def scenario_from_args(args: argparse.Namespace) -> Dict[str, str]:
+    return {
+        "name": f"{args.model} {args.source_lang}->{args.target_lang}",
+        "model": args.model,
+        "source_lang": args.source_lang,
+        "target_lang": args.target_lang,
+        "column": args.column,
+        "scene": args.scene,
+    }
+
+
+def build_config_and_capability(
+    args: argparse.Namespace,
+    *,
+    batch_size_override: int | None = None,
+) -> tuple[Dict[str, Any], Dict[str, Any]]:
     config = copy.deepcopy(get_translation_config())
+    for name, cfg in config["capabilities"].items():
+        cfg["enabled"] = name == args.model
+    config["default_model"] = args.model
     capability = get_translation_capability(config, args.model, require_enabled=False)
     if args.device_override:
         capability["device"] = args.device_override
     if args.torch_dtype_override:
         capability["torch_dtype"] = args.torch_dtype_override
-    if args.batch_size:
+    if batch_size_override is not None:
+        capability["batch_size"] = batch_size_override
+    elif args.batch_size:
         capability["batch_size"] = args.batch_size
     if args.max_new_tokens:
         capability["max_new_tokens"] = args.max_new_tokens
@@ -164,28 +245,59 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]:
         capability["num_beams"] = args.num_beams
     if args.attn_implementation:
         capability["attn_implementation"] = args.attn_implementation
+    if args.disable_cache:
+        capability["use_cache"] = False
     config["capabilities"][args.model] = capability
-    configured_batch_size = int(capability.get("batch_size") or 1)
-    batch_size = configured_batch_size
-    texts = load_texts(csv_path, args.column, args.limit)
+    return config, capability
-    service = TranslationService(config)
+
+def ensure_cuda_stats_reset() -> None:
     if torch.cuda.is_available():
         torch.cuda.empty_cache()
         torch.cuda.reset_peak_memory_stats()
-    load_start = time.perf_counter()
-    backend = service.get_backend(args.model)
-    load_seconds = time.perf_counter() - load_start
-    warmup_batches = min(max(args.warmup_batches, 0), max(1, math.ceil(len(texts) / batch_size)))
-    for batch in list(batched(texts, batch_size))[:warmup_batches]:
+def build_memory_metrics() -> Dict[str, Any]:
+    peak_gpu_mem_gb = None
+    peak_gpu_reserved_gb = None
+    if torch.cuda.is_available():
+        peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3)
+        peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3)
+    max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2)
+    return {
+        "max_rss_mb": max_rss_mb,
+        "peak_gpu_memory_gb": peak_gpu_mem_gb,
+        "peak_gpu_reserved_gb": peak_gpu_reserved_gb,
+    }
+
+
+def make_request_payload(batch: Sequence[str]) -> str | List[str]:
+    if len(batch) == 1:
+        return batch[0]
+    return list(batch)
+
+
+def benchmark_serial_case(
+    *,
+    service: TranslationService,
+    backend: Any,
+    scenario: Dict[str, str],
+    capability: Dict[str, Any],
+    texts: List[str],
+    batch_size: int,
+    warmup_batches: int,
+) -> Dict[str, Any]:
+    backend.batch_size = batch_size
+    measured_batches = list(batched(texts, batch_size))
+    warmup_count = min(max(warmup_batches, 0), len(measured_batches))
+
+    for batch in measured_batches[:warmup_count]:
         service.translate(
-            text=batch,
-            source_lang=args.source_lang,
-            target_lang=args.target_lang,
-            model=args.model,
-            scene=args.scene,
+            text=make_request_payload(batch),
+            source_lang=scenario["source_lang"],
+            target_lang=scenario["target_lang"],
+            model=scenario["model"],
+            scene=scenario["scene"],
         )
     batch_latencies_ms: List[float] = []
@@ -193,86 +305,318 @@ def benchmark_single_scenario(args: argparse.Namespace) -&gt; Dict[str, Any]:
     failure_count = 0
     output_chars = 0
     total_input_chars = sum(len(text) for text in texts)
-    measured_batches = list(batched(texts, batch_size))
     start = time.perf_counter()
     for batch in measured_batches:
         batch_start = time.perf_counter()
         outputs = service.translate(
-            text=batch,
-            source_lang=args.source_lang,
-            target_lang=args.target_lang,
-            model=args.model,
-            scene=args.scene,
+            text=make_request_payload(batch),
+            source_lang=scenario["source_lang"],
+            target_lang=scenario["target_lang"],
+            model=scenario["model"],
+            scene=scenario["scene"],
         )
         elapsed_ms = (time.perf_counter() - batch_start) * 1000
         batch_latencies_ms.append(elapsed_ms)
-        if not isinstance(outputs, list):
-            raise RuntimeError(f"Expected list output for batch translation, got {type(outputs)!r}")
-        for item in outputs:
+        if isinstance(outputs, list):
+            result_items = outputs
+        else:
+            result_items = [outputs]
+        for item in result_items:
             if item is None:
                 failure_count += 1
             else:
                 success_count += 1
                 output_chars += len(item)
     translate_seconds = time.perf_counter() - start
+    total_items = len(texts)
+    memory = build_memory_metrics()
-    peak_gpu_mem_gb = None
-    peak_gpu_reserved_gb = None
-    if torch.cuda.is_available():
-        peak_gpu_mem_gb = round(torch.cuda.max_memory_allocated() / (1024 ** 3), 3)
-        peak_gpu_reserved_gb = round(torch.cuda.max_memory_reserved() / (1024 ** 3), 3)
+    return {
+        "mode": "serial_batch",
+        "batch_size": batch_size,
+        "concurrency": 1,
+        "rows": total_items,
+        "requests": len(measured_batches),
+        "input_chars": total_input_chars,
+        "load_seconds": 0.0,
+        "translate_seconds": round(translate_seconds, 4),
+        "total_seconds": round(translate_seconds, 4),
+        "batch_count": len(batch_latencies_ms),
+        "request_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2),
+        "request_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2),
+        "request_latency_max_ms": round(max(batch_latencies_ms), 2),
+        "avg_request_latency_ms": round(statistics.fmean(batch_latencies_ms), 2),
+        "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3),
+        "requests_per_second": round(len(measured_batches) / translate_seconds, 2),
+        "items_per_second": round(total_items / translate_seconds, 2),
+        "input_chars_per_second": round(total_input_chars / translate_seconds, 2),
+        "output_chars_per_second": round(output_chars / translate_seconds, 2),
+        "success_count": success_count,
+        "failure_count": failure_count,
+        "success_rate": round(success_count / total_items, 6),
+        "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
+        "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
+        "configured_batch_size": int(capability.get("batch_size") or batch_size),
+        "used_batch_size": batch_size,
+        "warmup_batches": warmup_count,
+        **memory,
+    }
-    max_rss_mb = round(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024, 2)
-    total_items = len(texts)
+
+def benchmark_concurrency_case(
+    *,
+    service: TranslationService,
+    backend: Any,
+    scenario: Dict[str, str],
+    capability: Dict[str, Any],
+    texts: List[str],
+    batch_size: int,
+    concurrency: int,
+    requests_per_case: int,
+    warmup_batches: int,
+) -> Dict[str, Any]:
+    backend.batch_size = batch_size
+    required_items = batch_size * requests_per_case
+    case_texts = texts[:required_items]
+    request_batches = list(batched(case_texts, batch_size))
+    if not request_batches:
+        raise ValueError("No request batches prepared for concurrency benchmark")
+    warmup_count = min(max(warmup_batches, 0), len(request_batches))
+
+    for batch in request_batches[:warmup_count]:
+        service.translate(
+            text=make_request_payload(batch),
+            source_lang=scenario["source_lang"],
+            target_lang=scenario["target_lang"],
+            model=scenario["model"],
+            scene=scenario["scene"],
+        )
+
+    request_latencies_ms: List[float] = []
+    success_count = 0
+    failure_count = 0
+    output_chars = 0
+    total_input_chars = sum(len(text) for text in case_texts)
+
+    def worker(batch: List[str]) -> tuple[float, int, int, int]:
+        started = time.perf_counter()
+        outputs = service.translate(
+            text=make_request_payload(batch),
+            source_lang=scenario["source_lang"],
+            target_lang=scenario["target_lang"],
+            model=scenario["model"],
+            scene=scenario["scene"],
+        )
+        elapsed_ms = (time.perf_counter() - started) * 1000
+        if isinstance(outputs, list):
+            result_items = outputs
+        else:
+            result_items = [outputs]
+        local_success = 0
+        local_failure = 0
+        local_output_chars = 0
+        for item in result_items:
+            if item is None:
+                local_failure += 1
+            else:
+                local_success += 1
+                local_output_chars += len(item)
+        return elapsed_ms, local_success, local_failure, local_output_chars
+
+    wall_start = time.perf_counter()
+    with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
+        futures = [executor.submit(worker, batch) for batch in request_batches]
+        for future in concurrent.futures.as_completed(futures):
+            latency_ms, local_success, local_failure, local_output_chars = future.result()
+            request_latencies_ms.append(latency_ms)
+            success_count += local_success
+            failure_count += local_failure
+            output_chars += local_output_chars
+    wall_seconds = time.perf_counter() - wall_start
+    total_items = len(case_texts)
+    memory = build_memory_metrics()
     return {
-        "scenario": {
-            "name": f"{args.model} {args.source_lang}->{args.target_lang}",
-            "model": args.model,
-            "source_lang": args.source_lang,
-            "target_lang": args.target_lang,
-            "column": args.column,
-            "scene": args.scene,
+        "mode": "concurrency",
+        "batch_size": batch_size,
+        "concurrency": concurrency,
+        "rows": total_items,
+        "requests": len(request_batches),
+        "input_chars": total_input_chars,
+        "load_seconds": 0.0,
+        "translate_seconds": round(wall_seconds, 4),
+        "total_seconds": round(wall_seconds, 4),
+        "batch_count": len(request_latencies_ms),
+        "request_latency_p50_ms": round(percentile(request_latencies_ms, 0.50), 2),
+        "request_latency_p95_ms": round(percentile(request_latencies_ms, 0.95), 2),
+        "request_latency_max_ms": round(max(request_latencies_ms), 2),
+        "avg_request_latency_ms": round(statistics.fmean(request_latencies_ms), 2),
+        "avg_item_latency_ms": round((wall_seconds / total_items) * 1000, 3),
+        "requests_per_second": round(len(request_batches) / wall_seconds, 2),
+        "items_per_second": round(total_items / wall_seconds, 2),
+        "input_chars_per_second": round(total_input_chars / wall_seconds, 2),
+        "output_chars_per_second": round(output_chars / wall_seconds, 2),
+        "success_count": success_count,
+        "failure_count": failure_count,
+        "success_rate": round(success_count / total_items, 6),
+        "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
+        "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
+        "configured_batch_size": int(capability.get("batch_size") or batch_size),
+        "used_batch_size": batch_size,
+        "warmup_batches": warmup_count,
+        **memory,
+    }
+
+
+def benchmark_single_scenario(args: argparse.Namespace) -> Dict[str, Any]:
+    csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
+    scenario = scenario_from_args(args)
+    config, capability = build_config_and_capability(args)
+    configured_batch_size = int(capability.get("batch_size") or 1)
+    batch_size = configured_batch_size
+    texts = load_texts(csv_path, args.column, args.limit)
+
+    ensure_cuda_stats_reset()
+    load_start = time.perf_counter()
+    service = TranslationService(config)
+    backend = service.get_backend(args.model)
+    load_seconds = time.perf_counter() - load_start
+
+    runtime = benchmark_serial_case(
+        service=service,
+        backend=backend,
+        scenario=scenario,
+        capability=capability,
+        texts=texts,
+        batch_size=batch_size,
+        warmup_batches=args.warmup_batches,
+    )
+    runtime["load_seconds"] = round(load_seconds, 4)
+    runtime["total_seconds"] = round(runtime["load_seconds"] + runtime["translate_seconds"], 4)
+
+    return {
+        "scenario": scenario,
+        "dataset": {
+            "csv_path": str(csv_path),
+            "rows": len(texts),
+            "input_chars": sum(len(text) for text in texts),
         },
+        "runtime": runtime,
+    }
+
+
+def benchmark_extended_scenario(args: argparse.Namespace) -> Dict[str, Any]:
+    csv_path = (PROJECT_ROOT / args.csv_path).resolve() if not Path(args.csv_path).is_absolute() else Path(args.csv_path)
+    scenario = scenario_from_args(args)
+    batch_sizes = parse_csv_ints(args.batch_size_list, DEFAULT_BATCH_SIZES)
+    concurrencies = parse_csv_ints(args.concurrency_list, DEFAULT_CONCURRENCIES)
+    largest_batch = max(batch_sizes + [args.concurrency_batch_size])
+    largest_concurrency = max(concurrencies)
+    max_product = args.max_batch_concurrency_product
+    required_items = max(
+        args.limit or 0,
+        max(args.serial_items_per_case, largest_batch),
+        args.concurrency_requests_per_case * args.concurrency_batch_size,
+        largest_batch * args.concurrency_requests_per_case,
+    )
+    texts = load_texts(csv_path, args.column, required_items)
+    config, capability = build_config_and_capability(args)
+
+    ensure_cuda_stats_reset()
+    load_start = time.perf_counter()
+    service = TranslationService(config)
+    backend = service.get_backend(args.model)
+    load_seconds = time.perf_counter() - load_start
+
+    batch_sweep: List[Dict[str, Any]] = []
+    concurrency_sweep: List[Dict[str, Any]] = []
+    matrix_results: List[Dict[str, Any]] = []
+
+    for batch_size in batch_sizes:
+        case_texts = texts[: max(batch_size, args.serial_items_per_case)]
+        batch_sweep.append(
+            benchmark_serial_case(
+                service=service,
+                backend=backend,
+                scenario=scenario,
+                capability=capability,
+                texts=case_texts,
+                batch_size=batch_size,
+                warmup_batches=args.warmup_batches,
+            )
+        )
+
+    for concurrency in concurrencies:
+        concurrency_sweep.append(
+            benchmark_concurrency_case(
+                service=service,
+                backend=backend,
+                scenario=scenario,
+                capability=capability,
+                texts=texts,
+                batch_size=args.concurrency_batch_size,
+                concurrency=concurrency,
+                requests_per_case=args.concurrency_requests_per_case,
+                warmup_batches=args.warmup_batches,
+            )
+        )
+
+    for batch_size in batch_sizes:
+        for concurrency in concurrencies:
+            if max_product > 0 and batch_size * concurrency > max_product:
+                continue
+            matrix_results.append(
+                benchmark_concurrency_case(
+                    service=service,
+                    backend=backend,
+                    scenario=scenario,
+                    capability=capability,
+                    texts=texts,
+                    batch_size=batch_size,
+                    concurrency=concurrency,
+                    requests_per_case=args.concurrency_requests_per_case,
+                    warmup_batches=args.warmup_batches,
+                )
+            )
+
+    for collection in (batch_sweep, concurrency_sweep, matrix_results):
+        for idx, item in enumerate(collection):
+            item["load_seconds"] = round(load_seconds if idx == 0 else 0.0, 4)
+            item["total_seconds"] = round(item["load_seconds"] + item["translate_seconds"], 4)
+
+    return {
+        "scenario": scenario,
         "dataset": {
             "csv_path": str(csv_path),
-            "rows": total_items,
-            "input_chars": total_input_chars,
+            "rows_loaded": len(texts),
+        },
+        "config": {
+            "batch_sizes": batch_sizes,
+            "concurrencies": concurrencies,
+            "serial_items_per_case": args.serial_items_per_case,
+            "concurrency_requests_per_case": args.concurrency_requests_per_case,
+            "concurrency_batch_size": args.concurrency_batch_size,
+            "max_batch_concurrency_product": max_product,
+            "cache_disabled": bool(args.disable_cache),
         },
-        "runtime": {
+        "runtime_defaults": {
             "device": str(getattr(backend, "device", capability.get("device", "unknown"))),
             "torch_dtype": str(getattr(backend, "torch_dtype", capability.get("torch_dtype", "unknown"))),
-            "configured_batch_size": configured_batch_size,
-            "used_batch_size": batch_size,
-            "warmup_batches": warmup_batches,
+            "configured_batch_size": int(capability.get("batch_size") or 1),
             "load_seconds": round(load_seconds, 4),
-            "translate_seconds": round(translate_seconds, 4),
-            "total_seconds": round(load_seconds + translate_seconds, 4),
-            "batch_count": len(batch_latencies_ms),
-            "first_batch_ms": round(batch_latencies_ms[0], 2),
-            "batch_latency_p50_ms": round(percentile(batch_latencies_ms, 0.50), 2),
-            "batch_latency_p95_ms": round(percentile(batch_latencies_ms, 0.95), 2),
-            "batch_latency_max_ms": round(max(batch_latencies_ms), 2),
-            "avg_batch_latency_ms": round(statistics.fmean(batch_latencies_ms), 2),
-            "avg_item_latency_ms": round((translate_seconds / total_items) * 1000, 3),
-            "items_per_second": round(total_items / translate_seconds, 2),
-            "input_chars_per_second": round(total_input_chars / translate_seconds, 2),
-            "output_chars_per_second": round(output_chars / translate_seconds, 2),
-            "success_count": success_count,
-            "failure_count": failure_count,
-            "success_rate": round(success_count / total_items, 6),
-            "max_rss_mb": max_rss_mb,
-            "peak_gpu_memory_gb": peak_gpu_mem_gb,
-            "peak_gpu_reserved_gb": peak_gpu_reserved_gb,
         },
+        "batch_sweep": batch_sweep,
+        "concurrency_sweep": concurrency_sweep,
+        "matrix": matrix_results,
     }
 def run_all_scenarios(args: argparse.Namespace) -> Dict[str, Any]:
     report = {
         "generated_at": datetime.now().isoformat(timespec="seconds"),
+        "suite": args.suite,
         "environment": build_environment_info(),
         "scenarios": [],
     }
@@ -296,11 +640,25 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
             scenario["scene"],
             "--warmup-batches",
             str(args.warmup_batches),
+            "--suite",
+            args.suite,
+            "--serial-items-per-case",
+            str(args.serial_items_per_case),
+            "--concurrency-requests-per-case",
+            str(args.concurrency_requests_per_case),
+            "--concurrency-batch-size",
+            str(args.concurrency_batch_size),
+            "--max-batch-concurrency-product",
+            str(args.max_batch_concurrency_product),
         ]
         if args.limit:
             cmd.extend(["--limit", str(args.limit)])
         if args.batch_size:
             cmd.extend(["--batch-size", str(args.batch_size)])
+        if args.batch_size_list:
+            cmd.extend(["--batch-size-list", args.batch_size_list])
+        if args.concurrency_list:
+            cmd.extend(["--concurrency-list", args.concurrency_list])
         if args.device_override:
             cmd.extend(["--device-override", args.device_override])
         if args.torch_dtype_override:
@@ -311,6 +669,8 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
             cmd.extend(["--num-beams", str(args.num_beams)])
         if args.attn_implementation:
             cmd.extend(["--attn-implementation", args.attn_implementation])
+        if args.disable_cache:
+            cmd.append("--disable-cache")
         completed = subprocess.run(cmd, capture_output=True, text=True, check=True)
         result_line = ""
@@ -327,11 +687,12 @@ def run_all_scenarios(args: argparse.Namespace) -&gt; Dict[str, Any]:
     return report
-def render_markdown_report(report: Dict[str, Any]) -> str:
+def render_baseline_markdown_report(report: Dict[str, Any]) -> str:
     lines = [
         "# Local Translation Model Benchmark",
         "",
         f"- Generated at: `{report['generated_at']}`",
+        f"- Suite: `{report['suite']}`",
         f"- Python: `{report['environment']['python']}`",
         f"- Torch: `{report['environment']['torch']}`",
         f"- Transformers: `{report['environment']['transformers']}`",
@@ -342,19 +703,19 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
     lines.extend(
         [
             "",
-            "| Scenario | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Load s | Peak GPU GiB | Success |",
+            "| Scenario | Items/s | Avg item ms | Req p50 ms | Req p95 ms | Load s | Peak GPU GiB | Success |",
             "|---|---:|---:|---:|---:|---:|---:|---:|",
         ]
     )
     for item in report["scenarios"]:
         runtime = item["runtime"]
         lines.append(
-            "| {name} | {items_per_second} | {avg_item_latency_ms} | {batch_latency_p50_ms} | {batch_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format(
+            "| {name} | {items_per_second} | {avg_item_latency_ms} | {request_latency_p50_ms} | {request_latency_p95_ms} | {load_seconds} | {peak_gpu_memory_gb} | {success_rate} |".format(
                 name=item["scenario"]["name"],
                 items_per_second=runtime["items_per_second"],
                 avg_item_latency_ms=runtime["avg_item_latency_ms"],
-                batch_latency_p50_ms=runtime["batch_latency_p50_ms"],
-                batch_latency_p95_ms=runtime["batch_latency_p95_ms"],
+                request_latency_p50_ms=runtime["request_latency_p50_ms"],
+                request_latency_p95_ms=runtime["request_latency_p95_ms"],
                 load_seconds=runtime["load_seconds"],
                 peak_gpu_memory_gb=runtime["peak_gpu_memory_gb"],
                 success_rate=runtime["success_rate"],
@@ -375,7 +736,7 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
                 f"- Load time: `{runtime['load_seconds']} s`",
                 f"- Translate time: `{runtime['translate_seconds']} s`",
                 f"- Throughput: `{runtime['items_per_second']} items/s`, `{runtime['input_chars_per_second']} input chars/s`",
-                f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, batch p50 `{runtime['batch_latency_p50_ms']} ms`, batch p95 `{runtime['batch_latency_p95_ms']} ms`, batch max `{runtime['batch_latency_max_ms']} ms`",
+                f"- Latency: avg item `{runtime['avg_item_latency_ms']} ms`, req p50 `{runtime['request_latency_p50_ms']} ms`, req p95 `{runtime['request_latency_p95_ms']} ms`, req max `{runtime['request_latency_max_ms']} ms`",
                 f"- Memory: max RSS `{runtime['max_rss_mb']} MB`, peak GPU allocated `{runtime['peak_gpu_memory_gb']} GiB`, peak GPU reserved `{runtime['peak_gpu_reserved_gb']} GiB`",
                 f"- Success: `{runtime['success_count']}/{dataset['rows']}`",
                 "",
@@ -384,32 +745,145 @@ def render_markdown_report(report: Dict[str, Any]) -&gt; str:
     return "\n".join(lines)
+def render_case_table(
+    title: str,
+    rows: Sequence[Dict[str, Any]],
+    *,
+    include_batch: bool,
+    include_concurrency: bool,
+) -> List[str]:
+    headers = ["Rows", "Requests", "Items/s", "Req/s", "Avg req ms", "Req p50 ms", "Req p95 ms", "Peak GPU GiB"]
+    prefix_headers: List[str] = []
+    if include_batch:
+        prefix_headers.append("Batch")
+    if include_concurrency:
+        prefix_headers.append("Concurrency")
+    headers = prefix_headers + headers
+    lines = [f"### {title}", ""]
+    lines.append("| " + " | ".join(headers) + " |")
+    lines.append("|" + "|".join(["---:"] * len(headers)) + "|")
+    for item in rows:
+        values: List[str] = []
+        if include_batch:
+            values.append(str(item["batch_size"]))
+        if include_concurrency:
+            values.append(str(item["concurrency"]))
+        values.extend(
+            [
+                str(item["rows"]),
+                str(item["requests"]),
+                str(item["items_per_second"]),
+                str(item["requests_per_second"]),
+                str(item["avg_request_latency_ms"]),
+                str(item["request_latency_p50_ms"]),
+                str(item["request_latency_p95_ms"]),
+                str(item["peak_gpu_memory_gb"]),
+            ]
+        )
+        lines.append("| " + " | ".join(values) + " |")
+    lines.append("")
+    return lines
+
+
+def render_extended_markdown_report(report: Dict[str, Any]) -> str:
+    lines = [
+        "# Local Translation Model Extended Benchmark",
+        "",
+        f"- Generated at: `{report['generated_at']}`",
+        f"- Suite: `{report['suite']}`",
+        f"- Python: `{report['environment']['python']}`",
+        f"- Torch: `{report['environment']['torch']}`",
+        f"- Transformers: `{report['environment']['transformers']}`",
+        f"- CUDA: `{report['environment']['cuda_available']}`",
+    ]
+    if report["environment"]["gpu_name"]:
+        lines.append(f"- GPU: `{report['environment']['gpu_name']}` ({report['environment']['gpu_total_mem_gb']} GiB)")
+
+    lines.extend(
+        [
+            "",
+            "## Reading Guide",
+            "",
+            "- `batch_sweep`: single stream only (`concurrency=1`), used to compare bulk translation efficiency across batch sizes.",
+            "- `concurrency_sweep`: fixed request batch size, used to compare online request latency and throughput as concurrency rises.",
+            "- `matrix`: combined `batch_size x concurrency` runs, filtered by `batch_size * concurrency <= limit` when configured.",
+            "",
+        ]
+    )
+
+    for item in report["scenarios"]:
+        lines.extend(
+            [
+                f"## {item['scenario']['name']}",
+                "",
+                f"- Direction: `{item['scenario']['source_lang']} -> {item['scenario']['target_lang']}`",
+                f"- Column: `{item['scenario']['column']}`",
+                f"- Loaded rows: `{item['dataset']['rows_loaded']}`",
+                f"- Load time: `{item['runtime_defaults']['load_seconds']} s`",
+                f"- Device: `{item['runtime_defaults']['device']}`",
+                f"- DType: `{item['runtime_defaults']['torch_dtype']}`",
+                f"- Cache disabled: `{item['config']['cache_disabled']}`",
+                "",
+            ]
+        )
+        lines.extend(render_case_table("Batch Sweep (`concurrency=1`)", item["batch_sweep"], include_batch=True, include_concurrency=False))
+        lines.extend(
+            render_case_table(
+                f"Concurrency Sweep (`batch_size={item['config']['concurrency_batch_size']}`)",
+                item["concurrency_sweep"],
+                include_batch=False,
+                include_concurrency=True,
+            )
+        )
+        lines.extend(render_case_table("Batch x Concurrency Matrix", item["matrix"], include_batch=True, include_concurrency=True))
+    return "\n".join(lines)
+
+
+def render_markdown_report(report: Dict[str, Any]) -> str:
+    if report["suite"] == "extended":
+        return render_extended_markdown_report(report)
+    return render_baseline_markdown_report(report)
+
+
 def main() -> None:
     args = parse_args()
     if args.single:
-        result = benchmark_single_scenario(args)
+        if args.suite == "extended":
+            result = benchmark_extended_scenario(args)
+        else:
+            result = benchmark_single_scenario(args)
         print("JSON_RESULT=" + json.dumps(result, ensure_ascii=False))
         return
     report = run_all_scenarios(args)
     output_dir = resolve_output_dir(args.output_dir)
     timestamp = datetime.now().strftime("%H%M%S")
-    json_path = output_dir / f"translation_local_models_{timestamp}.json"
-    md_path = output_dir / f"translation_local_models_{timestamp}.md"
+    suffix = "extended" if args.suite == "extended" else "baseline"
+    json_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.json"
+    md_path = output_dir / f"translation_local_models_{suffix}_{timestamp}.md"
     json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
     md_path.write_text(render_markdown_report(report), encoding="utf-8")
     print(f"JSON report: {json_path}")
     print(f"Markdown report: {md_path}")
     for item in report["scenarios"]:
-        runtime = item["runtime"]
-        print(
-            f"{item['scenario']['name']}: "
-            f"{runtime['items_per_second']} items/s | "
-            f"avg_item={runtime['avg_item_latency_ms']} ms | "
-            f"p95_batch={runtime['batch_latency_p95_ms']} ms | "
-            f"load={runtime['load_seconds']} s"
-        )
+        if args.suite == "extended":
+            best_batch = max(item["batch_sweep"], key=lambda x: x["items_per_second"])
+            best_concurrency = max(item["concurrency_sweep"], key=lambda x: x["items_per_second"])
+            print(
+                f"{item['scenario']['name']}: "
+                f"best_batch={best_batch['batch_size']} ({best_batch['items_per_second']} items/s) | "
+                f"best_concurrency={best_concurrency['concurrency']} ({best_concurrency['items_per_second']} items/s @ batch={best_concurrency['batch_size']})"
+            )
+        else:
+            runtime = item["runtime"]
+            print(
+                f"{item['scenario']['name']}: "
+                f"{runtime['items_per_second']} items/s | "
+                f"avg_item={runtime['avg_item_latency_ms']} ms | "
+                f"p95_req={runtime['request_latency_p95_ms']} ms | "
+                f"load={runtime['load_seconds']} s"
+            )
 if __name__ == "__main__":
@@ -13,7 +13,7 @@
 - 虚拟环境：[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh)
 - 模型下载：[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py)
 - 本地模型压测：[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
-- 性能报告：[`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md)
+- 性能报告：[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
 ## 1. 设计目标
@@ -530,6 +530,44 @@ curl -X POST http://127.0.0.1:6006/translate \
 数据集：
 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
+最新报告：
+- 摘要：[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
+- 完整 Markdown：[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
+- 完整 JSON：[`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
+
+### 10.1 先看哪组数据
+
+这里把 3 类结果分开看，不再混在一张表里：
+
+- `batch_sweep`
+  固定 `concurrency=1`，只比较不同 `batch_size` 的单流批处理性能
+- `concurrency_sweep`
+  固定 `batch_size=1`，看“单条请求”在不同并发下的延迟和吞吐
+- `batch x concurrency matrix`
+  同时看 `batch_size` 和 `concurrency` 的交互效应；本轮限制为 `batch_size * concurrency <= 128`
+
+建议：
+
+- 看线上 query 翻译延迟：优先看 `concurrency_sweep`
+- 看离线批量翻译吞吐：优先看 `batch_sweep`
+- 看单 worker 服务容量边界：再看 `batch x concurrency matrix`
+
+### 10.2 本轮补测参数
+
+测试时间：`2026-03-18`
+
+环境：
+- GPU：`Tesla T4 16GB`
+- Python env：`.venv-translator`
+- Torch / Transformers：`2.10.0+cu128 / 5.3.0`
+
+统一参数：
+- cache：关闭（`--disable-cache`），避免缓存命中干扰性能结果
+- `batch_sweep`：每档 `256` items
+- `concurrency_sweep`：固定 `batch_size=1`，每档 `32` requests
+- `batch x concurrency matrix`：每档 `32` requests，且只保留 `batch_size * concurrency <= 128`
+- 预热：`1` batch
+
 复现命令：
 ```bash
@@ -537,16 +575,36 @@ cd /data/saas-search
 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py
 ```
-单模型复现示例：
+本轮扩展压测复现命令：
+
+```bash
+cd /data/saas-search
+./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
+  --suite extended \
+  --disable-cache \
+  --serial-items-per-case 256 \
+  --concurrency-requests-per-case 32 \
+  --concurrency-batch-size 1 \
+  --output-dir perf_reports/20260318/translation_local_models
+```
+
+单模型扩展压测示例：
 ```bash
 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
   --single \
+  --suite extended \
   --model opus-mt-zh-en \
   --source-lang zh \
   --target-lang en \
   --column title_cn \
-  --scene sku_name
+  --scene sku_name \
+  --disable-cache \
+  --batch-size-list 1,4,8,16,32,64 \
+  --concurrency-list 1,2,4,8,16,64 \
+  --serial-items-per-case 256 \
+  --concurrency-requests-per-case 32 \
+  --concurrency-batch-size 1
 ```
 单条请求延迟复现：
@@ -554,37 +612,143 @@ cd /data/saas-search
 ```bash
 ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
   --single \
+  --suite extended \
   --model nllb-200-distilled-600m \
   --source-lang zh \
   --target-lang en \
   --column title_cn \
   --scene sku_name \
-  --batch-size 1 \
-  --limit 100
+  --disable-cache \
+  --batch-size-list 1 \
+  --concurrency-list 1,2,4,8,16,64 \
+  --serial-items-per-case 256 \
+  --concurrency-requests-per-case 32 \
+  --concurrency-batch-size 1
 ```
-说明：
-- 对当前脚本和本地 backend 来说，“单条请求”可以直接等价理解为 `batch_size=1`
-- 此时脚本里的 `batch_latency_*`，就可以直接视为“单次请求延迟”指标
-- 线上搜索 query 翻译更应该关注这组数据，而不是大 batch 吞吐
+### 10.3 单流 batch 结果
+
+这组只看 `concurrency=1`，不要把这里的 `request p95` 当作线上并发请求的 p95。
+
+`nllb-200-distilled-600m zh -> en`
+
+| Batch | Items/s | Avg item ms | Req p95 ms |
+|---:|---:|---:|---:|
+| 1 | 2.91 | 343.488 | 616.27 |
+| 4 | 8.44 | 118.545 | 722.95 |
+| 8 | 14.85 | 67.335 | 728.47 |
+| 16 | 27.28 | 36.662 | 769.18 |
+| 32 | 38.6 | 25.908 | 1369.88 |
+| 64 | 58.3 | 17.152 | 1659.9 |
+
+`nllb-200-distilled-600m en -> zh`
+
+| Batch | Items/s | Avg item ms | Req p95 ms |
+|---:|---:|---:|---:|
+| 1 | 1.91 | 524.917 | 866.33 |
+| 4 | 4.94 | 202.473 | 1599.74 |
+| 8 | 8.25 | 121.188 | 1632.29 |
+| 16 | 13.52 | 73.956 | 1649.65 |
+| 32 | 21.27 | 47.017 | 1827.16 |
+| 64 | 32.64 | 30.641 | 2031.25 |
-当前单条请求实测（`Tesla T4`，`limit=100`）：
-- `nllb-200-distilled-600m zh->en`：p50 约 `292.54 ms`，p95 约 `624.12 ms`，平均约 `321.91 ms`
-- `nllb-200-distilled-600m en->zh`：p50 约 `481.61 ms`，p95 约 `1171.71 ms`，平均约 `542.47 ms`
+`opus-mt-zh-en zh -> en`
-当前压测环境：
-- GPU：`Tesla T4 16GB`
-- Python env：`.venv-translator`
-- 数据量：`18,576` 条商品标题
+| Batch | Items/s | Avg item ms | Req p95 ms |
+|---:|---:|---:|---:|
+| 1 | 6.15 | 162.536 | 274.74 |
+| 4 | 15.34 | 65.192 | 356.0 |
+| 8 | 25.51 | 39.202 | 379.84 |
+| 16 | 41.44 | 24.129 | 797.93 |
+| 32 | 54.36 | 18.397 | 1693.31 |
+| 64 | 70.15 | 14.255 | 2161.59 |
+
+`opus-mt-en-zh en -> zh`
+
+| Batch | Items/s | Avg item ms | Req p95 ms |
+|---:|---:|---:|---:|
+| 1 | 4.53 | 220.598 | 411.57 |
+| 4 | 10.12 | 98.844 | 761.49 |
+| 8 | 14.63 | 68.361 | 1930.85 |
+| 16 | 24.33 | 41.1 | 2098.54 |
+| 32 | 33.91 | 29.487 | 2152.28 |
+| 64 | 42.47 | 23.547 | 2371.85 |
+
+批处理结论：
+
+- 纯吞吐看，4 个方向的最佳 raw throughput 都出现在 `batch_size=64`
+- 如果还要兼顾单个 batch 的尾延迟，`batch_size=16` 往往更均衡
+- `opus-mt-zh-en` 是本轮 bulk 场景最快模型，`nllb en->zh` 最慢
+
+### 10.4 单条请求并发结果
+
+这组固定 `batch_size=1`，可以直接理解成“单条请求在不同并发下的表现”。
+
+`nllb-200-distilled-600m zh -> en`
+
+| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
+|---:|---:|---:|---:|---:|
+| 1 | 4.17 | 239.99 | 226.34 | 373.27 |
+| 2 | 4.1 | 477.99 | 459.36 | 703.96 |
+| 4 | 4.1 | 910.74 | 884.71 | 1227.01 |
+| 8 | 4.04 | 1697.73 | 1818.48 | 2383.8 |
+| 16 | 4.07 | 2801.91 | 3473.63 | 4145.92 |
+| 64 | 4.04 | 3714.49 | 3610.08 | 7337.3 |
+
+`nllb-200-distilled-600m en -> zh`
+
+| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
+|---:|---:|---:|---:|---:|
+| 1 | 2.16 | 463.18 | 439.54 | 670.78 |
+| 2 | 2.15 | 920.48 | 908.27 | 1213.3 |
+| 4 | 2.16 | 1759.87 | 1771.58 | 2158.04 |
+| 8 | 2.15 | 3284.44 | 3658.45 | 3971.01 |
+| 16 | 2.14 | 5669.15 | 7117.7 | 7522.48 |
+| 64 | 2.14 | 7631.14 | 7510.97 | 14139.03 |
+
+`opus-mt-zh-en zh -> en`
+
+| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
+|---:|---:|---:|---:|---:|
+| 1 | 9.21 | 108.53 | 91.7 | 179.12 |
+| 2 | 8.92 | 219.19 | 212.29 | 305.34 |
+| 4 | 9.09 | 411.76 | 420.08 | 583.97 |
+| 8 | 8.85 | 784.14 | 835.73 | 1043.06 |
+| 16 | 9.01 | 1278.4 | 1483.34 | 1994.56 |
+| 64 | 8.82 | 1687.08 | 1563.48 | 3381.58 |
+
+`opus-mt-en-zh en -> zh`
+
+| Concurrency | Items/s | Avg req ms | Req p50 ms | Req p95 ms |
+|---:|---:|---:|---:|---:|
+| 1 | 3.6 | 277.73 | 145.85 | 1180.37 |
+| 2 | 3.55 | 559.38 | 346.71 | 1916.96 |
+| 4 | 3.53 | 997.71 | 721.04 | 2944.17 |
+| 8 | 3.51 | 1644.28 | 1590.93 | 3632.99 |
+| 16 | 3.5 | 2600.18 | 2586.34 | 5554.04 |
+| 64 | 3.52 | 3366.52 | 2780.0 | 7950.41 |
-最终性能结果：
+并发结论：
+
+- 当前本地 seq2seq backend 内部是单模型锁，单 worker 下提高客户端并发基本不会提升吞吐，主要会把等待时间堆到请求延迟上
+- 线上 query 翻译如果追求稳定延迟，应优先控制在低并发；`8+` 并发后，4 个方向的 p95 都明显恶化
+- 在线场景里，`opus-mt-zh-en` 延迟最稳；`nllb en->zh` 最慢，且并发放大后尾延迟最明显
+
+### 10.5 batch x concurrency 怎么看
+
+完整矩阵见：
+- [`perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
+
+这张表主要回答两个问题：
-| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
-|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
-| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
-| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
-| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
-| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
+- 如果已经知道自己要跑离线批处理，`batch_size` 拉大后，在不同并发下吞吐会不会继续涨
+- 如果要拿单 worker 服务扛请求，在哪个 `batch_size x concurrency` 组合下开始明显排队
+
+本轮矩阵的共同特征：
+
+- 吞吐主要由 `batch_size` 决定，`concurrency` 不是主要增益来源
+- 在 `batch_size` 固定时，`concurrency` 从 `1` 升到 `2/4/8/...`，`items/s` 变化很小，但 `avg req ms / p95` 会持续抬升
+- 因此当前实现更像“单 worker + 内部串行 GPU 推理服务”，不是一个靠客户端并发放大吞吐的服务
 NLLB 性能优化经验：
@@ -632,7 +796,7 @@ NLLB 性能优化经验：
 - 运行方式：单 worker，避免重复加载
 更详细的性能说明见：
-- [`perf_reports/20260317/translation_local_models/README.md`](/data/saas-search/perf_reports/20260317/translation_local_models/README.md)
+- [`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
 ## 11. 开发说明