README.md 5.95 KB
Edit Raw Blame History Permalink


Local Translation Model Benchmark Report (CTranslate2)
测试脚本：


<code>scripts/benchmark_translation_local_models.py</code>


本轮 CT2 结果：


Markdown：<code>translation_local_models_ct2_extended_233253.md</code>
JSON：<code>translation_local_models_ct2_extended_233253.json</code>


对照基线：


基线 README：<code>../translation_local_models/README.md</code>
基线 Markdown：<code>../translation_local_models/translation_local_models_extended_221846.md</code>
基线 JSON：<code>../translation_local_models/translation_local_models_extended_221846.json</code>
对比分析：<code>comparison_vs_hf_baseline.md</code>


测试时间：


2026-03-18


环境：


GPU：Tesla T4 16GB
Python env：.venv-translator
Torch / Transformers：2.10.0+cu128 / 5.3.0
CTranslate2：4.7.1
数据集：<code>products_analyzed.csv</code>

Method
本轮参数与基线保持一致，方便直接对比：


suite=extended
关闭 cache：--disable-cache
batch_sweep：每档 256 items
concurrency_sweep：每档 32 requests
matrix：每档 32 requests
concurrency_batch_size=1
batch_size * concurrency <= 128
预热：1 batch


复现命令：
cd /data/saas-search
./.venv-translator/bin/python - <<'PY'
import json
from datetime import datetime
from pathlib import Path
from types import SimpleNamespace

from scripts.benchmark_translation_local_models import (
    SCENARIOS,
    benchmark_extended_scenario,
    build_environment_info,
    render_markdown_report,
)

output_dir = Path("perf_reports/20260318/translation_local_models_ct2")
output_dir.mkdir(parents=True, exist_ok=True)

common = dict(
    csv_path="products_analyzed.csv",
    limit=0,
    output_dir=str(output_dir),
    single=True,
    scene="sku_name",
    batch_size=0,
    device_override="",
    torch_dtype_override="",
    max_new_tokens=0,
    num_beams=0,
    attn_implementation="",
    warmup_batches=1,
    disable_cache=True,
    suite="extended",
    batch_size_list="",
    concurrency_list="",
    serial_items_per_case=256,
    concurrency_requests_per_case=32,
    concurrency_batch_size=1,
    max_batch_concurrency_product=128,
)

report = {
    "generated_at": datetime.now().isoformat(timespec="seconds"),
    "suite": "extended",
    "environment": build_environment_info(),
    "scenarios": [],
}

for scenario in SCENARIOS:
    args = SimpleNamespace(
        **common,
        model=scenario["model"],
        source_lang=scenario["source_lang"],
        target_lang=scenario["target_lang"],
        column=scenario["column"],
    )
    result = benchmark_extended_scenario(args)
    result["scenario"]["name"] = scenario["name"]
    report["scenarios"].append(result)

stamp = datetime.now().strftime("%H%M%S")
(output_dir / f"translation_local_models_ct2_extended_{stamp}.json").write_text(
    json.dumps(report, ensure_ascii=False, indent=2),
    encoding="utf-8",
)
(output_dir / f"translation_local_models_ct2_extended_{stamp}.md").write_text(
    render_markdown_report(report),
    encoding="utf-8",
)
PY

Key Results
1. 单流 batch sweep


Model
Direction
Best batch
Best items/s
Batch 16 items/s
Batch 16 p95 ms


nllb-200-distilled-600m
zh -> en
64
104.61
55.68
371.36


nllb-200-distilled-600m
en -> zh
64
91.26
42.42
408.81


opus-mt-zh-en
zh -> en
64
218.5
111.61
257.18


opus-mt-en-zh
en -> zh
32
145.12
102.05
396.16


解读：


4 个方向的 bulk 吞吐都明显高于原始 Hugging Face / PyTorch 基线。
nllb en->zh 的 batch 16 吞吐从 13.52 提升到 42.42 items/s，提升最明显。
opus-mt-en-zh 在 CT2 版本里最佳 batch 从 64 变成了 32，说明它不再需要极端大 batch 才能吃满吞吐。

2. 单条请求并发 sweep


Model
Direction
c=1 items/s
c=1 p95 ms
c=8 p95 ms
c=64 p95 ms


nllb-200-distilled-600m
zh -> en
8.97
163.53
1039.32
3031.64


nllb-200-distilled-600m
en -> zh
5.83
259.52
2193.01
5611.21


opus-mt-zh-en
zh -> en
27.85
60.61
390.32
1061.35


opus-mt-en-zh
en -> zh
11.02
351.74
863.08
2459.49


解读：


在线 query 指标提升非常明显，特别是 batch_size=1 的 p95 和 items/s。
CT2 下并发上升仍会推高尾延迟，但恶化幅度比基线小得多。
opus-mt-zh-en 仍然是在线场景最稳的本地模型；nllb 现在也进入了更可用的区间。

3. 是否达到预期
结论：


达到了，而且幅度很大。
本轮 CT2 版本已经满足“在线性能显著增强”的目标，不需要继续为吞吐/延迟做额外紧急优化。


判断依据：


4 个方向在 concurrency=1 下的 items/s 全部提升到原来的 2.0x-3.1x
4 个方向在 concurrency=1 下的 p95 全部下降到原来的 29%-44%
NLLB 两个方向的 batch_size=16 吞吐分别提升 2.04x 和 3.14x

Notes

这轮 peak_gpu_memory_gb 基本显示为 0.0，不是“CT2 不占显存”，而是当前脚本用的是 torch.cuda 统计，无法观测 CT2 的原生 CUDA 分配。
如果后续要补充“显存对比”维度，建议新增 nvidia-smi 采样或 NVML 指标采集。