Name Last Update
..
README.md Loading commit data...

README.md

Local Translation Model Benchmark Report

Test script: <code>scripts/benchmark_translation_local_models.py</code>

Test time: 2026-03-17

Environment:

Method:

  • opus-mt-zh-en and opus-mt-en-zh were benchmarked on the full dataset using their configured production settings.
  • nllb-200-distilled-600m was benchmarked on a 500-row subset after optimization.
  • nllb-200-distilled-600m was also benchmarked with batch_size=1 on a 100-row subset to approximate online query translation latency.
  • This report only keeps the final optimized results and final deployment recommendation.
  • Quality was intentionally not evaluated; this is a performance-only report.

Final Production-Like Config

For nllb-200-distilled-600m, the final recommended config on Tesla T4 is:

nllb-200-distilled-600m:
  enabled: true
  backend: "local_nllb"
  model_id: "facebook/nllb-200-distilled-600M"
  model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
  device: "cuda"
  torch_dtype: "float16"
  batch_size: 16
  max_input_length: 256
  max_new_tokens: 64
  num_beams: 1
  attn_implementation: "sdpa"

What actually helped:

  • cuda + float16
  • batch_size=16
  • max_new_tokens=64
  • attn_implementation=sdpa

What did not become the final recommendation:

  • batch_size=32 Throughput can improve further, but tail latency degrades too much for a balanced default.

Final Results

Model Direction Device Rows Load s Translate s Items/s Avg item ms Batch p50 ms Batch p95 ms
opus-mt-zh-en zh -> en cuda 18,576 3.1435 497.7513 37.32 26.795 301.99 1835.81
opus-mt-en-zh en -> zh cuda 18,576 3.1867 987.3994 18.81 53.155 449.14 2012.12
nllb-200-distilled-600m zh -> en cuda 500 7.3397 25.9577 19.26 51.915 832.64 1263.01
nllb-200-distilled-600m en -> zh cuda 500 7.4152 42.0405 11.89 84.081 1093.87 2107.44

Single-Request Latency

To model online search query translation, we reran NLLB with batch_size=1. In this mode, batch latency is request latency.

Model Direction Rows Load s Translate s Avg req ms Req p50 ms Req p95 ms Req max ms Items/s
nllb-200-distilled-600m zh -> en 100 6.8390 32.1909 321.909 292.54 624.12 819.67 3.11
nllb-200-distilled-600m en -> zh 100 6.8249 54.2470 542.470 481.61 1171.71 1751.85 1.84

Command used:

./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  --single \
  --model nllb-200-distilled-600m \
  --source-lang zh \
  --target-lang en \
  --column title_cn \
  --scene sku_name \
  --batch-size 1 \
  --limit 100

Takeaways for online use:

  • batch_size=1 can be treated as single-request latency for the current service path.
  • zh -> en is materially faster than en -> zh on this machine.
  • NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards.

NLLB Resource Reality

The common online claim that this model uses only about 1.25GB in float16 is best understood as a rough weight-size level, not end-to-end runtime memory.

Actual runtime on this machine:

  • loaded on cuda:0
  • actual parameter dtype verified as torch.float16
  • steady GPU memory after load: about 2.6 GiB
  • benchmark peak GPU memory: about 2.8-3.0 GiB

The difference comes from:

  • CUDA context
  • allocator reserved memory
  • runtime activations and temporary tensors
  • batch size
  • input length and generation length
  • framework overhead

Final Takeaways

  1. opus-mt-zh-en remains the fastest model on this machine.
  2. opus-mt-en-zh is slower but still very practical for bulk translation.
  3. nllb-200-distilled-600m is now fully usable on T4 after optimization.
  4. nllb is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.