Name	Last Update	Last Commit dbe04e9e – 统一排序漏斗协议，精简冗余字段与前端渲染逻辑 History
..
README.md	Loading commit data...
comparison_vs_hf_baseline.md	Loading commit data...
translation_local_models_ct2_extended_233253.md	Loading commit data...

README.md

Local Translation Model Benchmark Report (CTranslate2)

测试脚本：

<code>benchmarks/translation/benchmark_translation_local_models.py</code>

本轮 CT2 结果：

对照基线：

基线 README：<code>../translation_local_models/README.md</code>
基线 Markdown：<code>../translation_local_models/translation_local_models_extended_221846.md</code>
基线 JSON：<code>../translation_local_models/translation_local_models_extended_221846.json</code>
对比分析：<code>comparison_vs_hf_baseline.md</code>

测试时间：

2026-03-18

环境：

GPU：Tesla T4 16GB
Python env：.venv-translator
Torch / Transformers：2.10.0+cu128 / 5.3.0
CTranslate2：4.7.1
数据集：<code>products_analyzed.csv</code>

Method

本轮参数与基线保持一致，方便直接对比：

suite=extended
关闭 cache：--disable-cache
batch_sweep：每档 256 items
concurrency_sweep：每档 32 requests
matrix：每档 32 requests
concurrency_batch_size=1
batch_size * concurrency <= 128
预热：1 batch

复现命令：

cd /data/saas-search
./.venv-translator/bin/python - <<'PY'
import json
from datetime import datetime
from pathlib import Path
from types import SimpleNamespace

from benchmarks.translation.benchmark_translation_local_models import (
    SCENARIOS,
    benchmark_extended_scenario,
    build_environment_info,
    render_markdown_report,
)

output_dir = Path("perf_reports/20260318/translation_local_models_ct2")
output_dir.mkdir(parents=True, exist_ok=True)

common = dict(
    csv_path="products_analyzed.csv",
    limit=0,
    output_dir=str(output_dir),
    single=True,
    scene="sku_name",
    batch_size=0,
    device_override="",
    torch_dtype_override="",
    max_new_tokens=0,
    num_beams=0,
    attn_implementation="",
    warmup_batches=1,
    disable_cache=True,
    suite="extended",
    batch_size_list="",
    concurrency_list="",
    serial_items_per_case=256,
    concurrency_requests_per_case=32,
    concurrency_batch_size=1,
    max_batch_concurrency_product=128,
)

report = {
    "generated_at": datetime.now().isoformat(timespec="seconds"),
    "suite": "extended",
    "environment": build_environment_info(),
    "scenarios": [],
}

for scenario in SCENARIOS:
    args = SimpleNamespace(
        **common,
        model=scenario["model"],
        source_lang=scenario["source_lang"],
        target_lang=scenario["target_lang"],
        column=scenario["column"],
    )
    result = benchmark_extended_scenario(args)
    result["scenario"]["name"] = scenario["name"]
    report["scenarios"].append(result)

stamp = datetime.now().strftime("%H%M%S")
(output_dir / f"translation_local_models_ct2_extended_{stamp}.json").write_text(
    json.dumps(report, ensure_ascii=False, indent=2),
    encoding="utf-8",
)
(output_dir / f"translation_local_models_ct2_extended_{stamp}.md").write_text(
    render_markdown_report(report),
    encoding="utf-8",
)
PY

Key Results

1. 单流 batch sweep

Model	Direction	Best batch	Best items/s	Batch 16 items/s	Batch 16 p95 ms
`nllb-200-distilled-600m`	`zh -> en`	`64`	`104.61`	`55.68`	`371.36`
`nllb-200-distilled-600m`	`en -> zh`	`64`	`91.26`	`42.42`	`408.81`
`opus-mt-zh-en`	`zh -> en`	`64`	`218.5`	`111.61`	`257.18`
`opus-mt-en-zh`	`en -> zh`	`32`	`145.12`	`102.05`	`396.16`

解读：

4 个方向的 bulk 吞吐都明显高于原始 Hugging Face / PyTorch 基线。
nllb en->zh 的 batch 16 吞吐从 13.52 提升到 42.42 items/s，提升最明显。
opus-mt-en-zh 在 CT2 版本里最佳 batch 从 64 变成了 32，说明它不再需要极端大 batch 才能吃满吞吐。

2. 单条请求并发 sweep

Model	Direction	c=1 items/s	c=1 p95 ms	c=8 p95 ms	c=64 p95 ms
`nllb-200-distilled-600m`	`zh -> en`	`8.97`	`163.53`	`1039.32`	`3031.64`
`nllb-200-distilled-600m`	`en -> zh`	`5.83`	`259.52`	`2193.01`	`5611.21`
`opus-mt-zh-en`	`zh -> en`	`27.85`	`60.61`	`390.32`	`1061.35`
`opus-mt-en-zh`	`en -> zh`	`11.02`	`351.74`	`863.08`	`2459.49`

解读：

在线 query 指标提升非常明显，特别是 batch_size=1 的 p95 和 items/s。
CT2 下并发上升仍会推高尾延迟，但恶化幅度比基线小得多。
opus-mt-zh-en 仍然是在线场景最稳的本地模型；nllb 现在也进入了更可用的区间。

3. 是否达到预期

结论：

达到了，而且幅度很大。
本轮 CT2 版本已经满足“在线性能显著增强”的目标，不需要继续为吞吐/延迟做额外紧急优化。

判断依据：

4 个方向在 concurrency=1 下的 items/s 全部提升到原来的 2.0x-3.1x
4 个方向在 concurrency=1 下的 p95 全部下降到原来的 29%-44%
NLLB 两个方向的 batch_size=16 吞吐分别提升 2.04x 和 3.14x

Notes

这轮 peak_gpu_memory_gb 基本显示为 0.0，不是“CT2 不占显存”，而是当前脚本用的是 torch.cuda 统计，无法观测 CT2 的原生 CUDA 分配。
如果后续要补充“显存对比”维度，建议新增 nvidia-smi 采样或 NVML 指标采集。

GITLAB

ai-saas / saas-search

README.md

Local Translation Model Benchmark Report (CTranslate2)

Method

Key Results

1. 单流 batch sweep

2. 单条请求并发 sweep

3. 是否达到预期

Notes