README.md 6.43 KB
Edit Raw Blame History


Local Translation Model Focused T4 Tuning
测试脚本：


<code>benchmarks/translation/benchmark_translation_local_models_focus.py</code>


本轮聚焦结果：


Markdown：<code>translation_local_models_focus_235018.md</code>
JSON：<code>translation_local_models_focus_235018.json</code>


说明：


这份报告是第一轮 T4 聚焦调优结论。
对 nllb-200-distilled-600M，当前最新推荐已经由专项报告覆盖：
<code>../nllb_t4_product_names_ct2/README.md</code>
本页里关于 NLLB 的 ct2_inter_threads=2 + ct2_max_queued_batches=16 结论，应视为已被更新。


相关报告：


基线扩展报告：<code>../translation_local_models/README.md</code>
CT2 扩展报告：<code>../translation_local_models_ct2/README.md</code>
CT2 与 HF 对比：<code>../translation_local_models_ct2/comparison_vs_hf_baseline.md</code>


测试时间：


2026-03-18


环境：


GPU：Tesla T4 16GB
Python env：.venv-translator
Torch / Transformers：2.10.0+cu128 / 5.3.0
CTranslate2：4.7.1

Scope
这轮不再做完整矩阵，只看两个目标场景：


high batch + low concurrency


batch=32/64/128
concurrency=1

high concurrency + low batch


batch=1
concurrency=8/16/32/64


对比的两个 CT2 变体：


ct2_default


当前默认：ct2_inter_threads=1、ct2_max_queued_batches=0、ct2_batch_type=examples

ct2_tuned_t4


调优候选：ct2_inter_threads=2、ct2_max_queued_batches=16、ct2_batch_type=examples


Recommendation
结论先写在前面：


NLLB 推荐升级到 ct2_inter_threads=2 + ct2_max_queued_batches=16。
opus-mt-zh-en 维持默认更稳。
opus-mt-en-zh 在大 batch 和高并发吞吐上有收益，但在线 c=8 的 p95 有波动，不建议直接把同一套 tuned 参数作为线上默认。


这也是为什么当前配置只把 NLLB 调成了 tuned profile，而两个 Marian 模型保持保守默认值。
Key Results
1. NLLB 是这轮最值得调的模型
nllb-200-distilled-600m zh -> en


Scenario
Default
Tuned
结果


batch=64, concurrency=1 items/s
113.25
111.86
基本持平


batch=64, concurrency=1 p95 ms
662.38
657.84
基本持平


batch=1, concurrency=16 items/s
10.34
12.91
明显提升


batch=1, concurrency=16 p95 ms
1904.9
1368.92
明显下降


batch=1, concurrency=32 items/s
10.17
12.8
明显提升


batch=1, concurrency=32 p95 ms
2876.88
2350.5
明显下降


nllb-200-distilled-600m en -> zh


Scenario
Default
Tuned
结果


batch=64, concurrency=1 items/s
96.27
93.36
小幅回落


batch=64, concurrency=1 p95 ms
701.75
721.79
小幅变差


batch=1, concurrency=16 items/s
5.51
7.91
明显提升


batch=1, concurrency=16 p95 ms
4613.05
2039.17
大幅下降


batch=1, concurrency=32 items/s
5.46
7.9
明显提升


batch=1, concurrency=32 p95 ms
5554.4
3912.75
明显下降


解读：


NLLB 的 tuned profile 主要是把 T4 的并发潜力释放出来。
bulk 场景几乎没有受伤，尤其 zh -> en 基本持平。
在线场景收益非常大，所以这轮调优最应该落在 NLLB 上。

2. Marian 不适合统一套用 NLLB 的 tuned 参数
opus-mt-zh-en zh -> en


batch=64, concurrency=1：164.1 -> 151.21 items/s，默认更好
batch=1, concurrency=32：27.5 -> 29.83 items/s，tuned 略好
batch=1, concurrency=64：28.43 -> 26.85 items/s，默认更好


结论：


这个模型已经很轻，默认 profile 更均衡。
不值得为了少量中并发收益牺牲大 batch 或高并发稳定性。


opus-mt-en-zh en -> zh


batch=64, concurrency=1：114.34 -> 121.87 items/s
batch=128, concurrency=1：162.29 -> 210.29 items/s
batch=1, concurrency=16：11.22 -> 12.65 items/s
batch=1, concurrency=8 的 p95 从 798.77 变成 1199.98


结论：


这个模型对 tuned profile 更敏感，吞吐会明显变好。
但在线 c=8 的 p95 变差，说明它更像“专用吞吐配置”，不适合直接作为统一线上默认。

T4 Experience Summary
这轮真正有价值的经验：


经验 1：不要再用完整矩阵找方向。


先只看 high batch + low concurrency 和 high concurrency + low batch 两个极端，效率更高。

经验 2：NLLB 在 T4 上确实吃 inter_threads 和队列深度。


ct2_inter_threads=2
ct2_max_queued_batches=16
这组参数对高并发 batch=1 在线场景收益最明显。

经验 3：inter_threads=4 太激进。


它能把部分高并发吞吐继续往上推。
但会严重伤害大 batch 吞吐，尤其 batch=64 这类 bulk 场景。
因此不适合作为通用服务默认值。

经验 4：ct2_batch_type=tokens 不是当前 T4 的主增益点。


对 batch=1 的在线场景没有带来稳定收益。
当前项目里优先保留 examples 更稳妥。

经验 5：单模型单 worker 仍然是默认部署方式。


本轮调优解决的是同一 worker 内的 GPU 利用率问题。
不是靠堆 FastAPI worker 数来提吞吐。


Deployment / Config Tasks Worth Keeping
这些任务被证明是“应该沉淀到文档和配置里”的：


把本地 Marian / NLLB 统一迁移到 CTranslate2
使用 float16 转换并预生成 CT2 模型目录
保持单 worker，避免重复加载模型
对 NLLB 启用：


ct2_inter_threads=2
ct2_max_queued_batches=16
ct2_batch_type=examples

Marian 继续保守默认：


ct2_inter_threads=1
ct2_max_queued_batches=0


Next Step
如果下一轮继续压线上延迟，优先顺序建议是：


服务级微批处理队列
短文本 / 长文本分桶
为 opus-mt-en-zh 增加“在线默认”和“离线高吞吐”两套配置