perf_reports/20260317/translation_local_models/README.md

# Local Translation Model Benchmark Report
Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
Test time: `2026-03-17`
Environment:
- GPU: `Tesla T4 16GB`
- Driver / CUDA: `570.158.01 / 12.8`
- Python env: `.venv-translator`
- Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
Method:
- `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured production settings.
- `nllb-200-distilled-600m` was benchmarked on a `500`-row subset after optimization.
- This report only keeps the final optimized results and final deployment recommendation.
- Quality was intentionally not evaluated; this is a performance-only report.
## Final Production-Like Config
For `nllb-200-distilled-600m`, the final recommended config on `Tesla T4` is:
```yaml
nllb-200-distilled-600m:
  enabled: true
  backend: "local_nllb"
  model_id: "facebook/nllb-200-distilled-600M"
  model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
  device: "cuda"
  torch_dtype: "float16"
  batch_size: 16
  max_input_length: 256
  max_new_tokens: 64
  num_beams: 1
  attn_implementation: "sdpa"
```
What actually helped:
- `cuda + float16`
- `batch_size=16`
- `max_new_tokens=64`
- `attn_implementation=sdpa`
What did not become the final recommendation:
- `batch_size=32`
  Throughput can improve further, but tail latency degrades too much for a balanced default.
## Final Results
| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
## NLLB Resource Reality
The common online claim that this model uses only about `1.25GB` in `float16` is best understood as a rough weight-size level, not end-to-end runtime memory.
Actual runtime on this machine:
- loaded on `cuda:0`
- actual parameter dtype verified as `torch.float16`
- steady GPU memory after load: about `2.6 GiB`
- benchmark peak GPU memory: about `2.8-3.0 GiB`
The difference comes from:
- CUDA context
- allocator reserved memory
- runtime activations and temporary tensors
- batch size
- input length and generation length
- framework overhead
## Final Takeaways
1. `opus-mt-zh-en` remains the fastest model on this machine.
2. `opus-mt-en-zh` is slower but still very practical for bulk translation.
3. `nllb-200-distilled-600m` is now fully usable on T4 after optimization.
4. `nllb` is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.