README.md
Local Translation Model Benchmark Report
Test script: <code>scripts/benchmark_translation_local_models.py</code>
Test time: 2026-03-17
Environment:
- GPU:
Tesla T4 16GB - Driver / CUDA:
570.158.01 / 12.8 - Python env:
.venv-translator - Dataset: <code>products_analyzed.csv</code>
- Rows in dataset:
18,576
Method:
opus-mt-zh-enandopus-mt-en-zhwere benchmarked on the full dataset using their configured runtime settings from <code>config/config.yaml</code>.nllb-200-distilled-600mcould not complete GPU cold start in the current co-resident environment because GPU memory was already heavily occupied by other long-running services.- For
nllb-200-distilled-600m, I therefore ran CPU baselines on a128-row sample from the same CSV, usingdevice=cpu,torch_dtype=float32,batch_size=4, and then estimated full-dataset runtime from measured throughput. - Quality was intentionally not evaluated; this report is performance-only.
Current GPU co-residency at benchmark time:
text-embeddings-router: about1.3 GiBclip_server: about2.0 GiBVLLM::EngineCore: about7.2 GiBapi.translator_appprocess: about2.8 GiB- Total occupied before
nllbcold start: about13.4 / 16 GiB
Operational finding:
facebook/nllb-200-distilled-600Mcannot be reliably loaded on the current shared T4 node together with the existing long-running services above.- This is not a model-quality issue; it is a deployment-capacity issue.
Summary
| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Peak GPU GiB | Success |
|---|---|---|---|---|---|---|---|---|---|---|---|
opus-mt-zh-en |
zh -> en |
cuda |
18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | 0.382 | 1.000000 |
opus-mt-en-zh |
en -> zh |
cuda |
18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | 0.379 | 0.999569 |
nllb-200-distilled-600m |
zh -> en |
cpu |
128 | 4.4589 | 132.3088 | 0.97 | 1033.662 | 3853.39 | 6896.14 | 0.0 | 1.000000 |
nllb-200-distilled-600m |
en -> zh |
cpu |
128 | 4.5039 | 317.8845 | 0.40 | 2483.473 | 6138.87 | 35134.11 | 0.0 | 1.000000 |
Detailed Findings
1. opus-mt-zh-en
- Full dataset,
title_cn -> en, scene=sku_name - Throughput:
37.32 items/s - Average per-item latency:
26.795 ms - Batch latency:
p50 301.99 ms,p95 1835.81 ms,max 2181.61 ms - Input throughput:
1179.47 chars/s - Peak GPU allocated:
0.382 GiB - Peak GPU reserved:
0.473 GiB - Max RSS:
1355.21 MB - Success count:
18576/18576
Interpretation:
- This was the fastest of the three new local models in this benchmark.
- It is a strong candidate for large-scale
zh -> entitle translation on the current machine.
2. opus-mt-en-zh
- Full dataset,
title -> zh, scene=sku_name - Throughput:
18.81 items/s - Average per-item latency:
53.155 ms - Batch latency:
p50 449.14 ms,p95 2012.12 ms,max 2210.03 ms - Input throughput:
2081.66 chars/s - Peak GPU allocated:
0.379 GiB - Peak GPU reserved:
0.473 GiB - Max RSS:
1376.72 MB - Success count:
18568/18576 - Failure count:
8
Interpretation:
- Roughly half the item throughput of
opus-mt-zh-en. - Still practical on this T4 for offline bulk translation.
- The
8failed items are a runtime-stability signal worth keeping an eye on for production batch jobs, even though quality was not checked here.
3. nllb-200-distilled-600m
GPU result in the current shared environment:
- Cold start failed with CUDA OOM before benchmark could begin.
- Root cause was insufficient free VRAM on the shared T4, not a script error.
CPU baseline, zh -> en:
- Sample size:
128 - Throughput:
0.97 items/s - Average per-item latency:
1033.662 ms - Batch latency:
p50 3853.39 ms,p95 6896.14 ms,max 8039.91 ms - Max RSS:
3481.75 MB - Estimated full-dataset runtime at this throughput: about
19,150.52 s=319.18 min=5.32 h
CPU baseline, en -> zh:
- Sample size:
128 - Throughput:
0.40 items/s - Average per-item latency:
2483.473 ms - Batch latency:
p50 6138.87 ms,p95 35134.11 ms,max 37388.36 ms - Max RSS:
3483.60 MB - Estimated full-dataset runtime at this throughput: about
46,440 s=774 min=12.9 h
Interpretation:
- In the current node layout,
nllbis not a good fit for shared-GPU online service. - CPU fallback is functionally available but far slower than the Marian models.
- If
nllbis still desired, it should be considered for isolated GPU deployment, dedicated batch nodes, or lower-frequency offline tasks.
Practical Ranking On This Machine
By usable real-world performance on the current node:
opus-mt-zh-enopus-mt-en-zhnllb-200-distilled-600m
By deployment friendliness on the current shared T4:
opus-mt-zh-enopus-mt-en-zhnllb-200-distilled-600mbecause it currently cannot cold-start on GPU alongside the existing resident services