Name Last Update
..
README.md Loading commit data...

README.md

Local Translation Model Benchmark Report

Test script: <code>scripts/benchmark_translation_local_models.py</code>

Test time: 2026-03-17

Environment:

Method:

  • opus-mt-zh-en and opus-mt-en-zh were benchmarked on the full dataset using their configured runtime settings from <code>config/config.yaml</code>.
  • nllb-200-distilled-600m could not complete GPU cold start in the current co-resident environment because GPU memory was already heavily occupied by other long-running services.
  • For nllb-200-distilled-600m, I therefore ran CPU baselines on a 128-row sample from the same CSV, using device=cpu, torch_dtype=float32, batch_size=4, and then estimated full-dataset runtime from measured throughput.
  • Quality was intentionally not evaluated; this report is performance-only.

Current GPU co-residency at benchmark time:

  • text-embeddings-router: about 1.3 GiB
  • clip_server: about 2.0 GiB
  • VLLM::EngineCore: about 7.2 GiB
  • api.translator_app process: about 2.8 GiB
  • Total occupied before nllb cold start: about 13.4 / 16 GiB

Operational finding:

  • facebook/nllb-200-distilled-600M cannot be reliably loaded on the current shared T4 node together with the existing long-running services above.
  • This is not a model-quality issue; it is a deployment-capacity issue.

Summary

Model Direction Device Rows Load s Translate s Items/s Avg item ms Batch p50 ms Batch p95 ms Peak GPU GiB Success
opus-mt-zh-en zh -> en cuda 18,576 3.1435 497.7513 37.32 26.795 301.99 1835.81 0.382 1.000000
opus-mt-en-zh en -> zh cuda 18,576 3.1867 987.3994 18.81 53.155 449.14 2012.12 0.379 0.999569
nllb-200-distilled-600m zh -> en cpu 128 4.4589 132.3088 0.97 1033.662 3853.39 6896.14 0.0 1.000000
nllb-200-distilled-600m en -> zh cpu 128 4.5039 317.8845 0.40 2483.473 6138.87 35134.11 0.0 1.000000

Detailed Findings

1. opus-mt-zh-en

  • Full dataset, title_cn -> en, scene=sku_name
  • Throughput: 37.32 items/s
  • Average per-item latency: 26.795 ms
  • Batch latency: p50 301.99 ms, p95 1835.81 ms, max 2181.61 ms
  • Input throughput: 1179.47 chars/s
  • Peak GPU allocated: 0.382 GiB
  • Peak GPU reserved: 0.473 GiB
  • Max RSS: 1355.21 MB
  • Success count: 18576/18576

Interpretation:

  • This was the fastest of the three new local models in this benchmark.
  • It is a strong candidate for large-scale zh -> en title translation on the current machine.

2. opus-mt-en-zh

  • Full dataset, title -> zh, scene=sku_name
  • Throughput: 18.81 items/s
  • Average per-item latency: 53.155 ms
  • Batch latency: p50 449.14 ms, p95 2012.12 ms, max 2210.03 ms
  • Input throughput: 2081.66 chars/s
  • Peak GPU allocated: 0.379 GiB
  • Peak GPU reserved: 0.473 GiB
  • Max RSS: 1376.72 MB
  • Success count: 18568/18576
  • Failure count: 8

Interpretation:

  • Roughly half the item throughput of opus-mt-zh-en.
  • Still practical on this T4 for offline bulk translation.
  • The 8 failed items are a runtime-stability signal worth keeping an eye on for production batch jobs, even though quality was not checked here.

3. nllb-200-distilled-600m

GPU result in the current shared environment:

  • Cold start failed with CUDA OOM before benchmark could begin.
  • Root cause was insufficient free VRAM on the shared T4, not a script error.

CPU baseline, zh -> en:

  • Sample size: 128
  • Throughput: 0.97 items/s
  • Average per-item latency: 1033.662 ms
  • Batch latency: p50 3853.39 ms, p95 6896.14 ms, max 8039.91 ms
  • Max RSS: 3481.75 MB
  • Estimated full-dataset runtime at this throughput: about 19,150.52 s = 319.18 min = 5.32 h

CPU baseline, en -> zh:

  • Sample size: 128
  • Throughput: 0.40 items/s
  • Average per-item latency: 2483.473 ms
  • Batch latency: p50 6138.87 ms, p95 35134.11 ms, max 37388.36 ms
  • Max RSS: 3483.60 MB
  • Estimated full-dataset runtime at this throughput: about 46,440 s = 774 min = 12.9 h

Interpretation:

  • In the current node layout, nllb is not a good fit for shared-GPU online service.
  • CPU fallback is functionally available but far slower than the Marian models.
  • If nllb is still desired, it should be considered for isolated GPU deployment, dedicated batch nodes, or lower-frequency offline tasks.

Practical Ranking On This Machine

By usable real-world performance on the current node:

  1. opus-mt-zh-en
  2. opus-mt-en-zh
  3. nllb-200-distilled-600m

By deployment friendliness on the current shared T4:

  1. opus-mt-zh-en
  2. opus-mt-en-zh
  3. nllb-200-distilled-600m because it currently cannot cold-start on GPU alongside the existing resident services