# Local Translation Model Benchmark Report

Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)

Test time: `2026-03-17`

Environment:
- GPU: `Tesla T4 16GB`
- Driver / CUDA: `570.158.01 / 12.8`
- Python env: `.venv-translator`
- Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
- Rows in dataset: `18,576`

Method:
- `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured runtime settings from [`config/config.yaml`](/data/saas-search/config/config.yaml).
- `nllb-200-distilled-600m` could not complete GPU cold start in the current co-resident environment because GPU memory was already heavily occupied by other long-running services.
- For `nllb-200-distilled-600m`, I therefore ran CPU baselines on a `128`-row sample from the same CSV, using `device=cpu`, `torch_dtype=float32`, `batch_size=4`, and then estimated full-dataset runtime from measured throughput.
- Quality was intentionally not evaluated; this report is performance-only.

Current GPU co-residency at benchmark time:
- `text-embeddings-router`: about `1.3 GiB`
- `clip_server`: about `2.0 GiB`
- `VLLM::EngineCore`: about `7.2 GiB`
- `api.translator_app` process: about `2.8 GiB`
- Total occupied before `nllb` cold start: about `13.4 / 16 GiB`

Operational finding:
- `facebook/nllb-200-distilled-600M` cannot be reliably loaded on the current shared T4 node together with the existing long-running services above.
- This is not a model-quality issue; it is a deployment-capacity issue.

## Summary

| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Peak GPU GiB | Success |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | 0.382 | 1.000000 |
| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | 0.379 | 0.999569 |
| `nllb-200-distilled-600m` | `zh -> en` | `cpu` | 128 | 4.4589 | 132.3088 | 0.97 | 1033.662 | 3853.39 | 6896.14 | 0.0 | 1.000000 |
| `nllb-200-distilled-600m` | `en -> zh` | `cpu` | 128 | 4.5039 | 317.8845 | 0.40 | 2483.473 | 6138.87 | 35134.11 | 0.0 | 1.000000 |

## Detailed Findings

### 1. `opus-mt-zh-en`

- Full dataset, `title_cn -> en`, scene=`sku_name`
- Throughput: `37.32 items/s`
- Average per-item latency: `26.795 ms`
- Batch latency: `p50 301.99 ms`, `p95 1835.81 ms`, `max 2181.61 ms`
- Input throughput: `1179.47 chars/s`
- Peak GPU allocated: `0.382 GiB`
- Peak GPU reserved: `0.473 GiB`
- Max RSS: `1355.21 MB`
- Success count: `18576/18576`

Interpretation:
- This was the fastest of the three new local models in this benchmark.
- It is a strong candidate for large-scale `zh -> en` title translation on the current machine.

### 2. `opus-mt-en-zh`

- Full dataset, `title -> zh`, scene=`sku_name`
- Throughput: `18.81 items/s`
- Average per-item latency: `53.155 ms`
- Batch latency: `p50 449.14 ms`, `p95 2012.12 ms`, `max 2210.03 ms`
- Input throughput: `2081.66 chars/s`
- Peak GPU allocated: `0.379 GiB`
- Peak GPU reserved: `0.473 GiB`
- Max RSS: `1376.72 MB`
- Success count: `18568/18576`
- Failure count: `8`

Interpretation:
- Roughly half the item throughput of `opus-mt-zh-en`.
- Still practical on this T4 for offline bulk translation.
- The `8` failed items are a runtime-stability signal worth keeping an eye on for production batch jobs, even though quality was not checked here.

### 3. `nllb-200-distilled-600m`

GPU result in the current shared environment:
- Cold start failed with CUDA OOM before benchmark could begin.
- Root cause was insufficient free VRAM on the shared T4, not a script error.

CPU baseline, `zh -> en`:
- Sample size: `128`
- Throughput: `0.97 items/s`
- Average per-item latency: `1033.662 ms`
- Batch latency: `p50 3853.39 ms`, `p95 6896.14 ms`, `max 8039.91 ms`
- Max RSS: `3481.75 MB`
- Estimated full-dataset runtime at this throughput: about `19,150.52 s` = `319.18 min` = `5.32 h`

CPU baseline, `en -> zh`:
- Sample size: `128`
- Throughput: `0.40 items/s`
- Average per-item latency: `2483.473 ms`
- Batch latency: `p50 6138.87 ms`, `p95 35134.11 ms`, `max 37388.36 ms`
- Max RSS: `3483.60 MB`
- Estimated full-dataset runtime at this throughput: about `46,440 s` = `774 min` = `12.9 h`

Interpretation:
- In the current node layout, `nllb` is not a good fit for shared-GPU online service.
- CPU fallback is functionally available but far slower than the Marian models.
- If `nllb` is still desired, it should be considered for isolated GPU deployment, dedicated batch nodes, or lower-frequency offline tasks.

## Practical Ranking On This Machine

By usable real-world performance on the current node:
1. `opus-mt-zh-en`
2. `opus-mt-en-zh`
3. `nllb-200-distilled-600m`

By deployment friendliness on the current shared T4:
1. `opus-mt-zh-en`
2. `opus-mt-en-zh`
3. `nllb-200-distilled-600m` because it currently cannot cold-start on GPU alongside the existing resident services