Name	Last Update	Last Commit 5aaf0c7d – feat(indexer): 完善 enriched_taxonomy_attributes ... History
..
README.md	Loading commit data...

README.md

Local Translation Model Benchmark Report

Test script: <code>scripts/benchmark_translation_local_models.py</code>

Test time: 2026-03-17

Environment:

GPU: Tesla T4 16GB
Driver / CUDA: 570.158.01 / 12.8
Python env: .venv-translator
Dataset: <code>products_analyzed.csv</code>

Method:

opus-mt-zh-en and opus-mt-en-zh were benchmarked on the full dataset using their configured production settings.
nllb-200-distilled-600m was benchmarked on a 500-row subset after optimization.
nllb-200-distilled-600m was also benchmarked with batch_size=1 on a 100-row subset to approximate online query translation latency.
This report only keeps the final optimized results and final deployment recommendation.
Quality was intentionally not evaluated; this is a performance-only report.

Final Production-Like Config

For nllb-200-distilled-600m, the final recommended config on Tesla T4 is:

nllb-200-distilled-600m:
  enabled: true
  backend: "local_nllb"
  model_id: "facebook/nllb-200-distilled-600M"
  model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
  device: "cuda"
  torch_dtype: "float16"
  batch_size: 16
  max_input_length: 256
  max_new_tokens: 64
  num_beams: 1
  attn_implementation: "sdpa"

What actually helped:

cuda + float16
batch_size=16
max_new_tokens=64
attn_implementation=sdpa

What did not become the final recommendation:

batch_size=32 Throughput can improve further, but tail latency degrades too much for a balanced default.

Final Results

Model	Direction	Device	Rows	Load s	Translate s	Items/s	Avg item ms	Batch p50 ms	Batch p95 ms
`opus-mt-zh-en`	`zh -> en`	`cuda`	18,576	3.1435	497.7513	37.32	26.795	301.99	1835.81
`opus-mt-en-zh`	`en -> zh`	`cuda`	18,576	3.1867	987.3994	18.81	53.155	449.14	2012.12
`nllb-200-distilled-600m`	`zh -> en`	`cuda`	500	7.3397	25.9577	19.26	51.915	832.64	1263.01
`nllb-200-distilled-600m`	`en -> zh`	`cuda`	500	7.4152	42.0405	11.89	84.081	1093.87	2107.44

Single-Request Latency

To model online search query translation, we reran NLLB with batch_size=1. In this mode, batch latency is request latency.

Model	Direction	Rows	Load s	Translate s	Avg req ms	Req p50 ms	Req p95 ms	Req max ms	Items/s
`nllb-200-distilled-600m`	`zh -> en`	100	6.8390	32.1909	321.909	292.54	624.12	819.67	3.11
`nllb-200-distilled-600m`	`en -> zh`	100	6.8249	54.2470	542.470	481.61	1171.71	1751.85	1.84

Command used:

./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  --single \
  --model nllb-200-distilled-600m \
  --source-lang zh \
  --target-lang en \
  --column title_cn \
  --scene sku_name \
  --batch-size 1 \
  --limit 100

Takeaways for online use:

batch_size=1 can be treated as single-request latency for the current service path.
zh -> en is materially faster than en -> zh on this machine.
NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards.

NLLB Resource Reality

The common online claim that this model uses only about 1.25GB in float16 is best understood as a rough weight-size level, not end-to-end runtime memory.

Actual runtime on this machine:

loaded on cuda:0
actual parameter dtype verified as torch.float16
steady GPU memory after load: about 2.6 GiB
benchmark peak GPU memory: about 2.8-3.0 GiB

The difference comes from:

CUDA context
allocator reserved memory
runtime activations and temporary tensors
batch size
input length and generation length
framework overhead

Final Takeaways

opus-mt-zh-en remains the fastest model on this machine.
opus-mt-en-zh is slower but still very practical for bulk translation.
nllb-200-distilled-600m is now fully usable on T4 after optimization.
nllb is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.

GITLAB

ai-saas / saas-search