Name	Last Update	Last Commit 00471f80 – trans History
..
README.md	Loading commit data...

README.md

Local Translation Model Benchmark Report

Test script: <code>scripts/benchmark_translation_local_models.py</code>

Test time: 2026-03-17

Environment:

GPU: Tesla T4 16GB
Driver / CUDA: 570.158.01 / 12.8
Python env: .venv-translator
Dataset: <code>products_analyzed.csv</code>
Rows in dataset: 18,576

Method:

opus-mt-zh-en and opus-mt-en-zh were benchmarked on the full dataset using their configured runtime settings from <code>config/config.yaml</code>.
nllb-200-distilled-600m could not complete GPU cold start in the current co-resident environment because GPU memory was already heavily occupied by other long-running services.
For nllb-200-distilled-600m, I therefore ran CPU baselines on a 128-row sample from the same CSV, using device=cpu, torch_dtype=float32, batch_size=4, and then estimated full-dataset runtime from measured throughput.
Quality was intentionally not evaluated; this report is performance-only.

Current GPU co-residency at benchmark time:

text-embeddings-router: about 1.3 GiB
clip_server: about 2.0 GiB
VLLM::EngineCore: about 7.2 GiB
api.translator_app process: about 2.8 GiB
Total occupied before nllb cold start: about 13.4 / 16 GiB

Operational finding:

facebook/nllb-200-distilled-600M cannot be reliably loaded on the current shared T4 node together with the existing long-running services above.
This is not a model-quality issue; it is a deployment-capacity issue.

Summary

Model	Direction	Device	Rows	Load s	Translate s	Items/s	Avg item ms	Batch p50 ms	Batch p95 ms	Peak GPU GiB	Success
`opus-mt-zh-en`	`zh -> en`	`cuda`	18,576	3.1435	497.7513	37.32	26.795	301.99	1835.81	0.382	1.000000
`opus-mt-en-zh`	`en -> zh`	`cuda`	18,576	3.1867	987.3994	18.81	53.155	449.14	2012.12	0.379	0.999569
`nllb-200-distilled-600m`	`zh -> en`	`cpu`	128	4.4589	132.3088	0.97	1033.662	3853.39	6896.14	0.0	1.000000
`nllb-200-distilled-600m`	`en -> zh`	`cpu`	128	4.5039	317.8845	0.40	2483.473	6138.87	35134.11	0.0	1.000000

Detailed Findings

1. `opus-mt-zh-en`

Full dataset, title_cn -> en, scene=sku_name
Throughput: 37.32 items/s
Average per-item latency: 26.795 ms
Batch latency: p50 301.99 ms, p95 1835.81 ms, max 2181.61 ms
Input throughput: 1179.47 chars/s
Peak GPU allocated: 0.382 GiB
Peak GPU reserved: 0.473 GiB
Max RSS: 1355.21 MB
Success count: 18576/18576

Interpretation:

This was the fastest of the three new local models in this benchmark.
It is a strong candidate for large-scale zh -> en title translation on the current machine.

2. `opus-mt-en-zh`

Full dataset, title -> zh, scene=sku_name
Throughput: 18.81 items/s
Average per-item latency: 53.155 ms
Batch latency: p50 449.14 ms, p95 2012.12 ms, max 2210.03 ms
Input throughput: 2081.66 chars/s
Peak GPU allocated: 0.379 GiB
Peak GPU reserved: 0.473 GiB
Max RSS: 1376.72 MB
Success count: 18568/18576
Failure count: 8

Interpretation:

Roughly half the item throughput of opus-mt-zh-en.
Still practical on this T4 for offline bulk translation.
The 8 failed items are a runtime-stability signal worth keeping an eye on for production batch jobs, even though quality was not checked here.

3. `nllb-200-distilled-600m`

GPU result in the current shared environment:

Cold start failed with CUDA OOM before benchmark could begin.
Root cause was insufficient free VRAM on the shared T4, not a script error.

CPU baseline, zh -> en:

Sample size: 128
Throughput: 0.97 items/s
Average per-item latency: 1033.662 ms
Batch latency: p50 3853.39 ms, p95 6896.14 ms, max 8039.91 ms
Max RSS: 3481.75 MB
Estimated full-dataset runtime at this throughput: about 19,150.52 s = 319.18 min = 5.32 h

CPU baseline, en -> zh:

Sample size: 128
Throughput: 0.40 items/s
Average per-item latency: 2483.473 ms
Batch latency: p50 6138.87 ms, p95 35134.11 ms, max 37388.36 ms
Max RSS: 3483.60 MB
Estimated full-dataset runtime at this throughput: about 46,440 s = 774 min = 12.9 h

Interpretation:

In the current node layout, nllb is not a good fit for shared-GPU online service.
CPU fallback is functionally available but far slower than the Marian models.
If nllb is still desired, it should be considered for isolated GPU deployment, dedicated batch nodes, or lower-frequency offline tasks.

Practical Ranking On This Machine

By usable real-world performance on the current node:

opus-mt-zh-en
opus-mt-en-zh
nllb-200-distilled-600m

By deployment friendliness on the current shared T4:

opus-mt-zh-en
opus-mt-en-zh
nllb-200-distilled-600m because it currently cannot cold-start on GPU alongside the existing resident services

GITLAB

ai-saas / saas-search

README.md

Local Translation Model Benchmark Report

Summary

Detailed Findings

1. `opus-mt-zh-en`

2. `opus-mt-en-zh`

3. `nllb-200-distilled-600m`

Practical Ranking On This Machine

README.md

Local Translation Model Benchmark Report

Summary

Detailed Findings

1. opus-mt-zh-en

2. opus-mt-en-zh

3. nllb-200-distilled-600m

Practical Ranking On This Machine

1. `opus-mt-zh-en`

2. `opus-mt-en-zh`

3. `nllb-200-distilled-600m`