# Local Translation Model Benchmark Report (CTranslate2) 测试脚本: - [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py) 本轮 CT2 结果: - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md) - JSON:[`translation_local_models_ct2_extended_233253.json`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.json) 对照基线: - 基线 README:[`../translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) - 基线 Markdown:[`../translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) - 基线 JSON:[`../translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) - 对比分析:[`comparison_vs_hf_baseline.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/comparison_vs_hf_baseline.md) 测试时间: - `2026-03-18` 环境: - GPU:`Tesla T4 16GB` - Python env:`.venv-translator` - Torch / Transformers:`2.10.0+cu128 / 5.3.0` - CTranslate2:`4.7.1` - 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) ## Method 本轮参数与基线保持一致,方便直接对比: - `suite=extended` - 关闭 cache:`--disable-cache` - `batch_sweep`:每档 `256` items - `concurrency_sweep`:每档 `32` requests - `matrix`:每档 `32` requests - `concurrency_batch_size=1` - `batch_size * concurrency <= 128` - 预热:`1` batch 复现命令: ```bash cd /data/saas-search ./.venv-translator/bin/python - <<'PY' import json from datetime import datetime from pathlib import Path from types import SimpleNamespace from benchmarks.translation.benchmark_translation_local_models import ( SCENARIOS, benchmark_extended_scenario, build_environment_info, render_markdown_report, ) output_dir = Path("perf_reports/20260318/translation_local_models_ct2") output_dir.mkdir(parents=True, exist_ok=True) common = dict( csv_path="products_analyzed.csv", limit=0, output_dir=str(output_dir), single=True, scene="sku_name", batch_size=0, device_override="", torch_dtype_override="", max_new_tokens=0, num_beams=0, attn_implementation="", warmup_batches=1, disable_cache=True, suite="extended", batch_size_list="", concurrency_list="", serial_items_per_case=256, concurrency_requests_per_case=32, concurrency_batch_size=1, max_batch_concurrency_product=128, ) report = { "generated_at": datetime.now().isoformat(timespec="seconds"), "suite": "extended", "environment": build_environment_info(), "scenarios": [], } for scenario in SCENARIOS: args = SimpleNamespace( **common, model=scenario["model"], source_lang=scenario["source_lang"], target_lang=scenario["target_lang"], column=scenario["column"], ) result = benchmark_extended_scenario(args) result["scenario"]["name"] = scenario["name"] report["scenarios"].append(result) stamp = datetime.now().strftime("%H%M%S") (output_dir / f"translation_local_models_ct2_extended_{stamp}.json").write_text( json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8", ) (output_dir / f"translation_local_models_ct2_extended_{stamp}.md").write_text( render_markdown_report(report), encoding="utf-8", ) PY ``` ## Key Results ### 1. 单流 batch sweep | Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms | |---|---|---:|---:|---:|---:| | `nllb-200-distilled-600m` | `zh -> en` | `64` | `104.61` | `55.68` | `371.36` | | `nllb-200-distilled-600m` | `en -> zh` | `64` | `91.26` | `42.42` | `408.81` | | `opus-mt-zh-en` | `zh -> en` | `64` | `218.5` | `111.61` | `257.18` | | `opus-mt-en-zh` | `en -> zh` | `32` | `145.12` | `102.05` | `396.16` | 解读: - 4 个方向的 bulk 吞吐都明显高于原始 Hugging Face / PyTorch 基线。 - `nllb en->zh` 的 batch 16 吞吐从 `13.52` 提升到 `42.42 items/s`,提升最明显。 - `opus-mt-en-zh` 在 CT2 版本里最佳 batch 从 `64` 变成了 `32`,说明它不再需要极端大 batch 才能吃满吞吐。 ### 2. 单条请求并发 sweep | Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms | |---|---|---:|---:|---:|---:| | `nllb-200-distilled-600m` | `zh -> en` | `8.97` | `163.53` | `1039.32` | `3031.64` | | `nllb-200-distilled-600m` | `en -> zh` | `5.83` | `259.52` | `2193.01` | `5611.21` | | `opus-mt-zh-en` | `zh -> en` | `27.85` | `60.61` | `390.32` | `1061.35` | | `opus-mt-en-zh` | `en -> zh` | `11.02` | `351.74` | `863.08` | `2459.49` | 解读: - 在线 query 指标提升非常明显,特别是 `batch_size=1` 的 `p95` 和 `items/s`。 - CT2 下并发上升仍会推高尾延迟,但恶化幅度比基线小得多。 - `opus-mt-zh-en` 仍然是在线场景最稳的本地模型;`nllb` 现在也进入了更可用的区间。 ### 3. 是否达到预期 结论: - **达到了,而且幅度很大。** - 本轮 CT2 版本已经满足“在线性能显著增强”的目标,不需要继续为吞吐/延迟做额外紧急优化。 判断依据: - 4 个方向在 `concurrency=1` 下的 `items/s` 全部提升到原来的 `2.0x-3.1x` - 4 个方向在 `concurrency=1` 下的 `p95` 全部下降到原来的 `29%-44%` - NLLB 两个方向的 `batch_size=16` 吞吐分别提升 `2.04x` 和 `3.14x` ## Notes - 这轮 `peak_gpu_memory_gb` 基本显示为 `0.0`,不是“CT2 不占显存”,而是当前脚本用的是 `torch.cuda` 统计,无法观测 CT2 的原生 CUDA 分配。 - 如果后续要补充“显存对比”维度,建议新增 `nvidia-smi` 采样或 NVML 指标采集。