README.md
Local Translation Model Benchmark Report
Test script: <code>scripts/benchmark_translation_local_models.py</code>
Test time: 2026-03-17
Environment:
- GPU:
Tesla T4 16GB - Driver / CUDA:
570.158.01 / 12.8 - Python env:
.venv-translator - Dataset: <code>products_analyzed.csv</code>
Method:
opus-mt-zh-enandopus-mt-en-zhwere benchmarked on the full dataset using their configured production settings.nllb-200-distilled-600mwas benchmarked on a500-row subset after optimization.nllb-200-distilled-600mwas also benchmarked withbatch_size=1on a100-row subset to approximate online query translation latency.- This report only keeps the final optimized results and final deployment recommendation.
- Quality was intentionally not evaluated; this is a performance-only report.
Final Production-Like Config
For nllb-200-distilled-600m, the final recommended config on Tesla T4 is:
nllb-200-distilled-600m:
enabled: true
backend: "local_nllb"
model_id: "facebook/nllb-200-distilled-600M"
model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
device: "cuda"
torch_dtype: "float16"
batch_size: 16
max_input_length: 256
max_new_tokens: 64
num_beams: 1
attn_implementation: "sdpa"
What actually helped:
cuda + float16batch_size=16max_new_tokens=64attn_implementation=sdpa
What did not become the final recommendation:
batch_size=32Throughput can improve further, but tail latency degrades too much for a balanced default.
Final Results
| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
|---|---|---|---|---|---|---|---|---|---|
opus-mt-zh-en |
zh -> en |
cuda |
18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
opus-mt-en-zh |
en -> zh |
cuda |
18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
nllb-200-distilled-600m |
zh -> en |
cuda |
500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
nllb-200-distilled-600m |
en -> zh |
cuda |
500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
Single-Request Latency
To model online search query translation, we reran NLLB with batch_size=1. In this mode, batch latency is request latency.
| Model | Direction | Rows | Load s | Translate s | Avg req ms | Req p50 ms | Req p95 ms | Req max ms | Items/s |
|---|---|---|---|---|---|---|---|---|---|
nllb-200-distilled-600m |
zh -> en |
100 | 6.8390 | 32.1909 | 321.909 | 292.54 | 624.12 | 819.67 | 3.11 |
nllb-200-distilled-600m |
en -> zh |
100 | 6.8249 | 54.2470 | 542.470 | 481.61 | 1171.71 | 1751.85 | 1.84 |
Command used:
./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
--single \
--model nllb-200-distilled-600m \
--source-lang zh \
--target-lang en \
--column title_cn \
--scene sku_name \
--batch-size 1 \
--limit 100
Takeaways for online use:
batch_size=1can be treated as single-request latency for the current service path.zh -> enis materially faster thanen -> zhon this machine.- NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards.
NLLB Resource Reality
The common online claim that this model uses only about 1.25GB in float16 is best understood as a rough weight-size level, not end-to-end runtime memory.
Actual runtime on this machine:
- loaded on
cuda:0 - actual parameter dtype verified as
torch.float16 - steady GPU memory after load: about
2.6 GiB - benchmark peak GPU memory: about
2.8-3.0 GiB
The difference comes from:
- CUDA context
- allocator reserved memory
- runtime activations and temporary tensors
- batch size
- input length and generation length
- framework overhead
Final Takeaways
opus-mt-zh-enremains the fastest model on this machine.opus-mt-en-zhis slower but still very practical for bulk translation.nllb-200-distilled-600mis now fully usable on T4 after optimization.nllbis still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.