# Local Translation Model Focused T4 Tuning

测试脚本：
- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)

本轮聚焦结果：
- Markdown：[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md)
- JSON：[`translation_local_models_focus_235018.json`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.json)

说明：
- 这份报告是第一轮 T4 聚焦调优结论。
- 对 `nllb-200-distilled-600M`，当前最新推荐已经由专项报告覆盖：
  [`../nllb_t4_product_names_ct2/README.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/README.md)
- 本页里关于 NLLB 的 `ct2_inter_threads=2 + ct2_max_queued_batches=16` 结论，应视为已被更新。

相关报告：
- 基线扩展报告：[`../translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
- CT2 扩展报告：[`../translation_local_models_ct2/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/README.md)
- CT2 与 HF 对比：[`../translation_local_models_ct2/comparison_vs_hf_baseline.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/comparison_vs_hf_baseline.md)

测试时间：
- `2026-03-18`

环境：
- GPU：`Tesla T4 16GB`
- Python env：`.venv-translator`
- Torch / Transformers：`2.10.0+cu128 / 5.3.0`
- CTranslate2：`4.7.1`

## Scope

这轮不再做完整矩阵，只看两个目标场景：

- `high batch + low concurrency`
  - `batch=32/64/128`
  - `concurrency=1`
- `high concurrency + low batch`
  - `batch=1`
  - `concurrency=8/16/32/64`

对比的两个 CT2 变体：

- `ct2_default`
  - 当前默认：`ct2_inter_threads=1`、`ct2_max_queued_batches=0`、`ct2_batch_type=examples`
- `ct2_tuned_t4`
  - 调优候选：`ct2_inter_threads=2`、`ct2_max_queued_batches=16`、`ct2_batch_type=examples`

## Recommendation

结论先写在前面：

- **NLLB 推荐升级到 `ct2_inter_threads=2 + ct2_max_queued_batches=16`。**
- `opus-mt-zh-en` 维持默认更稳。
- `opus-mt-en-zh` 在大 batch 和高并发吞吐上有收益，但在线 `c=8` 的 `p95` 有波动，不建议直接把同一套 tuned 参数作为线上默认。

这也是为什么当前配置只把 NLLB 调成了 tuned profile，而两个 Marian 模型保持保守默认值。

## Key Results

### 1. NLLB 是这轮最值得调的模型

`nllb-200-distilled-600m zh -> en`

| Scenario | Default | Tuned | 结果 |
|---|---:|---:|---|
| `batch=64, concurrency=1` items/s | `113.25` | `111.86` | 基本持平 |
| `batch=64, concurrency=1` p95 ms | `662.38` | `657.84` | 基本持平 |
| `batch=1, concurrency=16` items/s | `10.34` | `12.91` | 明显提升 |
| `batch=1, concurrency=16` p95 ms | `1904.9` | `1368.92` | 明显下降 |
| `batch=1, concurrency=32` items/s | `10.17` | `12.8` | 明显提升 |
| `batch=1, concurrency=32` p95 ms | `2876.88` | `2350.5` | 明显下降 |

`nllb-200-distilled-600m en -> zh`

| Scenario | Default | Tuned | 结果 |
|---|---:|---:|---|
| `batch=64, concurrency=1` items/s | `96.27` | `93.36` | 小幅回落 |
| `batch=64, concurrency=1` p95 ms | `701.75` | `721.79` | 小幅变差 |
| `batch=1, concurrency=16` items/s | `5.51` | `7.91` | 明显提升 |
| `batch=1, concurrency=16` p95 ms | `4613.05` | `2039.17` | 大幅下降 |
| `batch=1, concurrency=32` items/s | `5.46` | `7.9` | 明显提升 |
| `batch=1, concurrency=32` p95 ms | `5554.4` | `3912.75` | 明显下降 |

解读：
- NLLB 的 tuned profile 主要是把 T4 的并发潜力释放出来。
- bulk 场景几乎没有受伤，尤其 `zh -> en` 基本持平。
- 在线场景收益非常大，所以这轮调优最应该落在 NLLB 上。

### 2. Marian 不适合统一套用 NLLB 的 tuned 参数

`opus-mt-zh-en zh -> en`

- `batch=64, concurrency=1`：`164.1 -> 151.21 items/s`，默认更好
- `batch=1, concurrency=32`：`27.5 -> 29.83 items/s`，tuned 略好
- `batch=1, concurrency=64`：`28.43 -> 26.85 items/s`，默认更好

结论：
- 这个模型已经很轻，默认 profile 更均衡。
- 不值得为了少量中并发收益牺牲大 batch 或高并发稳定性。

`opus-mt-en-zh en -> zh`

- `batch=64, concurrency=1`：`114.34 -> 121.87 items/s`
- `batch=128, concurrency=1`：`162.29 -> 210.29 items/s`
- `batch=1, concurrency=16`：`11.22 -> 12.65 items/s`
- `batch=1, concurrency=8` 的 `p95` 从 `798.77` 变成 `1199.98`

结论：
- 这个模型对 tuned profile 更敏感，吞吐会明显变好。
- 但在线 `c=8` 的 `p95` 变差，说明它更像“专用吞吐配置”，不适合直接作为统一线上默认。

## T4 Experience Summary

这轮真正有价值的经验：

- **经验 1：不要再用完整矩阵找方向。**
  - 先只看 `high batch + low concurrency` 和 `high concurrency + low batch` 两个极端，效率更高。

- **经验 2：NLLB 在 T4 上确实吃 `inter_threads` 和队列深度。**
  - `ct2_inter_threads=2`
  - `ct2_max_queued_batches=16`
  - 这组参数对高并发 `batch=1` 在线场景收益最明显。

- **经验 3：`inter_threads=4` 太激进。**
  - 它能把部分高并发吞吐继续往上推。
  - 但会严重伤害大 batch 吞吐，尤其 `batch=64` 这类 bulk 场景。
  - 因此不适合作为通用服务默认值。

- **经验 4：`ct2_batch_type=tokens` 不是当前 T4 的主增益点。**
  - 对 `batch=1` 的在线场景没有带来稳定收益。
  - 当前项目里优先保留 `examples` 更稳妥。

- **经验 5：单模型单 worker 仍然是默认部署方式。**
  - 本轮调优解决的是同一 worker 内的 GPU 利用率问题。
  - 不是靠堆 FastAPI worker 数来提吞吐。

## Deployment / Config Tasks Worth Keeping

这些任务被证明是“应该沉淀到文档和配置里”的：

- 把本地 Marian / NLLB 统一迁移到 CTranslate2
- 使用 `float16` 转换并预生成 CT2 模型目录
- 保持单 worker，避免重复加载模型
- 对 NLLB 启用：
  - `ct2_inter_threads=2`
  - `ct2_max_queued_batches=16`
  - `ct2_batch_type=examples`
- Marian 继续保守默认：
  - `ct2_inter_threads=1`
  - `ct2_max_queued_batches=0`

## Next Step

如果下一轮继续压线上延迟，优先顺序建议是：

1. 服务级微批处理队列
2. 短文本 / 长文本分桶
3. 为 `opus-mt-en-zh` 增加“在线默认”和“离线高吞吐”两套配置