# Local Translation Model Benchmark Report (CTranslate2)

测试脚本：
- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)

本轮 CT2 结果：
- Markdown：[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md)
- JSON：[`translation_local_models_ct2_extended_233253.json`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.json)

对照基线：
- 基线 README：[`../translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
- 基线 Markdown：[`../translation_local_models/translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
- 基线 JSON：[`../translation_local_models/translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
- 对比分析：[`comparison_vs_hf_baseline.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/comparison_vs_hf_baseline.md)

测试时间：
- `2026-03-18`

环境：
- GPU：`Tesla T4 16GB`
- Python env：`.venv-translator`
- Torch / Transformers：`2.10.0+cu128 / 5.3.0`
- CTranslate2：`4.7.1`
- 数据集：[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)

## Method

本轮参数与基线保持一致，方便直接对比：

- `suite=extended`
- 关闭 cache：`--disable-cache`
- `batch_sweep`：每档 `256` items
- `concurrency_sweep`：每档 `32` requests
- `matrix`：每档 `32` requests
- `concurrency_batch_size=1`
- `batch_size * concurrency <= 128`
- 预热：`1` batch

复现命令：

```bash
cd /data/saas-search
./.venv-translator/bin/python - <<'PY'
import json
from datetime import datetime
from pathlib import Path
from types import SimpleNamespace

from benchmarks.translation.benchmark_translation_local_models import (
    SCENARIOS,
    benchmark_extended_scenario,
    build_environment_info,
    render_markdown_report,
)

output_dir = Path("perf_reports/20260318/translation_local_models_ct2")
output_dir.mkdir(parents=True, exist_ok=True)

common = dict(
    csv_path="products_analyzed.csv",
    limit=0,
    output_dir=str(output_dir),
    single=True,
    scene="sku_name",
    batch_size=0,
    device_override="",
    torch_dtype_override="",
    max_new_tokens=0,
    num_beams=0,
    attn_implementation="",
    warmup_batches=1,
    disable_cache=True,
    suite="extended",
    batch_size_list="",
    concurrency_list="",
    serial_items_per_case=256,
    concurrency_requests_per_case=32,
    concurrency_batch_size=1,
    max_batch_concurrency_product=128,
)

report = {
    "generated_at": datetime.now().isoformat(timespec="seconds"),
    "suite": "extended",
    "environment": build_environment_info(),
    "scenarios": [],
}

for scenario in SCENARIOS:
    args = SimpleNamespace(
        **common,
        model=scenario["model"],
        source_lang=scenario["source_lang"],
        target_lang=scenario["target_lang"],
        column=scenario["column"],
    )
    result = benchmark_extended_scenario(args)
    result["scenario"]["name"] = scenario["name"]
    report["scenarios"].append(result)

stamp = datetime.now().strftime("%H%M%S")
(output_dir / f"translation_local_models_ct2_extended_{stamp}.json").write_text(
    json.dumps(report, ensure_ascii=False, indent=2),
    encoding="utf-8",
)
(output_dir / f"translation_local_models_ct2_extended_{stamp}.md").write_text(
    render_markdown_report(report),
    encoding="utf-8",
)
PY
```

## Key Results

### 1. 单流 batch sweep

| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
|---|---|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `64` | `104.61` | `55.68` | `371.36` |
| `nllb-200-distilled-600m` | `en -> zh` | `64` | `91.26` | `42.42` | `408.81` |
| `opus-mt-zh-en` | `zh -> en` | `64` | `218.5` | `111.61` | `257.18` |
| `opus-mt-en-zh` | `en -> zh` | `32` | `145.12` | `102.05` | `396.16` |

解读：
- 4 个方向的 bulk 吞吐都明显高于原始 Hugging Face / PyTorch 基线。
- `nllb en->zh` 的 batch 16 吞吐从 `13.52` 提升到 `42.42 items/s`，提升最明显。
- `opus-mt-en-zh` 在 CT2 版本里最佳 batch 从 `64` 变成了 `32`，说明它不再需要极端大 batch 才能吃满吞吐。

### 2. 单条请求并发 sweep

| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
|---|---|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `8.97` | `163.53` | `1039.32` | `3031.64` |
| `nllb-200-distilled-600m` | `en -> zh` | `5.83` | `259.52` | `2193.01` | `5611.21` |
| `opus-mt-zh-en` | `zh -> en` | `27.85` | `60.61` | `390.32` | `1061.35` |
| `opus-mt-en-zh` | `en -> zh` | `11.02` | `351.74` | `863.08` | `2459.49` |

解读：
- 在线 query 指标提升非常明显，特别是 `batch_size=1` 的 `p95` 和 `items/s`。
- CT2 下并发上升仍会推高尾延迟，但恶化幅度比基线小得多。
- `opus-mt-zh-en` 仍然是在线场景最稳的本地模型；`nllb` 现在也进入了更可用的区间。

### 3. 是否达到预期

结论：
- **达到了，而且幅度很大。**
- 本轮 CT2 版本已经满足“在线性能显著增强”的目标，不需要继续为吞吐/延迟做额外紧急优化。

判断依据：
- 4 个方向在 `concurrency=1` 下的 `items/s` 全部提升到原来的 `2.0x-3.1x`
- 4 个方向在 `concurrency=1` 下的 `p95` 全部下降到原来的 `29%-44%`
- NLLB 两个方向的 `batch_size=16` 吞吐分别提升 `2.04x` 和 `3.14x`

## Notes

- 这轮 `peak_gpu_memory_gb` 基本显示为 `0.0`，不是“CT2 不占显存”，而是当前脚本用的是 `torch.cuda` 统计，无法观测 CT2 的原生 CUDA 分配。
- 如果后续要补充“显存对比”维度，建议新增 `nvidia-smi` 采样或 NVML 指标采集。