# Local Translation Model Benchmark Report Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py) Test time: `2026-03-17` Environment: - GPU: `Tesla T4 16GB` - Driver / CUDA: `570.158.01 / 12.8` - Python env: `.venv-translator` - Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) Method: - `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured production settings. - `nllb-200-distilled-600m` was benchmarked on a `500`-row subset after optimization. - `nllb-200-distilled-600m` was also benchmarked with `batch_size=1` on a `100`-row subset to approximate online query translation latency. - This report only keeps the final optimized results and final deployment recommendation. - Quality was intentionally not evaluated; this is a performance-only report. ## Final Production-Like Config For `nllb-200-distilled-600m`, the final recommended config on `Tesla T4` is: ```yaml nllb-200-distilled-600m: enabled: true backend: "local_nllb" model_id: "facebook/nllb-200-distilled-600M" model_dir: "./models/translation/facebook/nllb-200-distilled-600M" device: "cuda" torch_dtype: "float16" batch_size: 16 max_input_length: 256 max_new_tokens: 64 num_beams: 1 attn_implementation: "sdpa" ``` What actually helped: - `cuda + float16` - `batch_size=16` - `max_new_tokens=64` - `attn_implementation=sdpa` What did not become the final recommendation: - `batch_size=32` Throughput can improve further, but tail latency degrades too much for a balanced default. ## Final Results | Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | |---|---|---:|---:|---:|---:|---:|---:|---:|---:| | `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | | `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | | `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 | | `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 | ## Single-Request Latency To model online search query translation, we reran NLLB with `batch_size=1`. In this mode, batch latency is request latency. | Model | Direction | Rows | Load s | Translate s | Avg req ms | Req p50 ms | Req p95 ms | Req max ms | Items/s | |---|---|---:|---:|---:|---:|---:|---:|---:|---:| | `nllb-200-distilled-600m` | `zh -> en` | 100 | 6.8390 | 32.1909 | 321.909 | 292.54 | 624.12 | 819.67 | 3.11 | | `nllb-200-distilled-600m` | `en -> zh` | 100 | 6.8249 | 54.2470 | 542.470 | 481.61 | 1171.71 | 1751.85 | 1.84 | Command used: ```bash ./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \ --single \ --model nllb-200-distilled-600m \ --source-lang zh \ --target-lang en \ --column title_cn \ --scene sku_name \ --batch-size 1 \ --limit 100 ``` Takeaways for online use: - `batch_size=1` can be treated as single-request latency for the current service path. - `zh -> en` is materially faster than `en -> zh` on this machine. - NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards. ## NLLB Resource Reality The common online claim that this model uses only about `1.25GB` in `float16` is best understood as a rough weight-size level, not end-to-end runtime memory. Actual runtime on this machine: - loaded on `cuda:0` - actual parameter dtype verified as `torch.float16` - steady GPU memory after load: about `2.6 GiB` - benchmark peak GPU memory: about `2.8-3.0 GiB` The difference comes from: - CUDA context - allocator reserved memory - runtime activations and temporary tensors - batch size - input length and generation length - framework overhead ## Final Takeaways 1. `opus-mt-zh-en` remains the fastest model on this machine. 2. `opus-mt-en-zh` is slower but still very practical for bulk translation. 3. `nllb-200-distilled-600m` is now fully usable on T4 after optimization. 4. `nllb` is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.