# Local Translation Model Benchmark Report Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) Test time: `2026-03-17` Environment: - GPU: `Tesla T4 16GB` - Driver / CUDA: `570.158.01 / 12.8` - Python env: `.venv-translator` - Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) - Rows in dataset: `18,576` Method: - `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured runtime settings from [`config/config.yaml`](/data/saas-search/config/config.yaml). - `nllb-200-distilled-600m` could not complete GPU cold start in the current co-resident environment because GPU memory was already heavily occupied by other long-running services. - For `nllb-200-distilled-600m`, I therefore ran CPU baselines on a `128`-row sample from the same CSV, using `device=cpu`, `torch_dtype=float32`, `batch_size=4`, and then estimated full-dataset runtime from measured throughput. - Quality was intentionally not evaluated; this report is performance-only. Current GPU co-residency at benchmark time: - `text-embeddings-router`: about `1.3 GiB` - `clip_server`: about `2.0 GiB` - `VLLM::EngineCore`: about `7.2 GiB` - `api.translator_app` process: about `2.8 GiB` - Total occupied before `nllb` cold start: about `13.4 / 16 GiB` Operational finding: - `facebook/nllb-200-distilled-600M` cannot be reliably loaded on the current shared T4 node together with the existing long-running services above. - This is not a model-quality issue; it is a deployment-capacity issue. ## Summary | Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms | Peak GPU GiB | Success | |---|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| | `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 | 0.382 | 1.000000 | | `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 | 0.379 | 0.999569 | | `nllb-200-distilled-600m` | `zh -> en` | `cpu` | 128 | 4.4589 | 132.3088 | 0.97 | 1033.662 | 3853.39 | 6896.14 | 0.0 | 1.000000 | | `nllb-200-distilled-600m` | `en -> zh` | `cpu` | 128 | 4.5039 | 317.8845 | 0.40 | 2483.473 | 6138.87 | 35134.11 | 0.0 | 1.000000 | ## Detailed Findings ### 1. `opus-mt-zh-en` - Full dataset, `title_cn -> en`, scene=`sku_name` - Throughput: `37.32 items/s` - Average per-item latency: `26.795 ms` - Batch latency: `p50 301.99 ms`, `p95 1835.81 ms`, `max 2181.61 ms` - Input throughput: `1179.47 chars/s` - Peak GPU allocated: `0.382 GiB` - Peak GPU reserved: `0.473 GiB` - Max RSS: `1355.21 MB` - Success count: `18576/18576` Interpretation: - This was the fastest of the three new local models in this benchmark. - It is a strong candidate for large-scale `zh -> en` title translation on the current machine. ### 2. `opus-mt-en-zh` - Full dataset, `title -> zh`, scene=`sku_name` - Throughput: `18.81 items/s` - Average per-item latency: `53.155 ms` - Batch latency: `p50 449.14 ms`, `p95 2012.12 ms`, `max 2210.03 ms` - Input throughput: `2081.66 chars/s` - Peak GPU allocated: `0.379 GiB` - Peak GPU reserved: `0.473 GiB` - Max RSS: `1376.72 MB` - Success count: `18568/18576` - Failure count: `8` Interpretation: - Roughly half the item throughput of `opus-mt-zh-en`. - Still practical on this T4 for offline bulk translation. - The `8` failed items are a runtime-stability signal worth keeping an eye on for production batch jobs, even though quality was not checked here. ### 3. `nllb-200-distilled-600m` GPU result in the current shared environment: - Cold start failed with CUDA OOM before benchmark could begin. - Root cause was insufficient free VRAM on the shared T4, not a script error. CPU baseline, `zh -> en`: - Sample size: `128` - Throughput: `0.97 items/s` - Average per-item latency: `1033.662 ms` - Batch latency: `p50 3853.39 ms`, `p95 6896.14 ms`, `max 8039.91 ms` - Max RSS: `3481.75 MB` - Estimated full-dataset runtime at this throughput: about `19,150.52 s` = `319.18 min` = `5.32 h` CPU baseline, `en -> zh`: - Sample size: `128` - Throughput: `0.40 items/s` - Average per-item latency: `2483.473 ms` - Batch latency: `p50 6138.87 ms`, `p95 35134.11 ms`, `max 37388.36 ms` - Max RSS: `3483.60 MB` - Estimated full-dataset runtime at this throughput: about `46,440 s` = `774 min` = `12.9 h` Interpretation: - In the current node layout, `nllb` is not a good fit for shared-GPU online service. - CPU fallback is functionally available but far slower than the Marian models. - If `nllb` is still desired, it should be considered for isolated GPU deployment, dedicated batch nodes, or lower-frequency offline tasks. ## Practical Ranking On This Machine By usable real-world performance on the current node: 1. `opus-mt-zh-en` 2. `opus-mt-en-zh` 3. `nllb-200-distilled-600m` By deployment friendliness on the current shared T4: 1. `opus-mt-zh-en` 2. `opus-mt-en-zh` 3. `nllb-200-distilled-600m` because it currently cannot cold-start on GPU alongside the existing resident services