00471f80
tangwang
trans
|
1
2
|
# Local Translation Model Benchmark Report
|
3abbc95a
tangwang
重构(scripts): 整理sc...
|
3
|
Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
|
00471f80
tangwang
trans
|
4
5
6
7
8
9
10
11
|
Test time: `2026-03-17`
Environment:
- GPU: `Tesla T4 16GB`
- Driver / CUDA: `570.158.01 / 12.8`
- Python env: `.venv-translator`
- Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
|
00471f80
tangwang
trans
|
12
13
|
Method:
|
3eff49b7
tangwang
trans nllb-200-di...
|
14
15
|
- `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured production settings.
- `nllb-200-distilled-600m` was benchmarked on a `500`-row subset after optimization.
|
1d6727ac
tangwang
trans
|
16
|
- `nllb-200-distilled-600m` was also benchmarked with `batch_size=1` on a `100`-row subset to approximate online query translation latency.
|
3eff49b7
tangwang
trans nllb-200-di...
|
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
- This report only keeps the final optimized results and final deployment recommendation.
- Quality was intentionally not evaluated; this is a performance-only report.
## Final Production-Like Config
For `nllb-200-distilled-600m`, the final recommended config on `Tesla T4` is:
```yaml
nllb-200-distilled-600m:
enabled: true
backend: "local_nllb"
model_id: "facebook/nllb-200-distilled-600M"
model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
device: "cuda"
torch_dtype: "float16"
batch_size: 16
max_input_length: 256
max_new_tokens: 64
num_beams: 1
attn_implementation: "sdpa"
```
What actually helped:
- `cuda + float16`
- `batch_size=16`
- `max_new_tokens=64`
- `attn_implementation=sdpa`
What did not become the final recommendation:
- `batch_size=32`
Throughput can improve further, but tail latency degrades too much for a balanced default.
## Final Results
| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
|
1d6727ac
tangwang
trans
|
58
59
60
61
62
63
64
65
66
67
68
69
|
## Single-Request Latency
To model online search query translation, we reran NLLB with `batch_size=1`. In this mode, batch latency is request latency.
| Model | Direction | Rows | Load s | Translate s | Avg req ms | Req p50 ms | Req p95 ms | Req max ms | Items/s |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | 100 | 6.8390 | 32.1909 | 321.909 | 292.54 | 624.12 | 819.67 | 3.11 |
| `nllb-200-distilled-600m` | `en -> zh` | 100 | 6.8249 | 54.2470 | 542.470 | 481.61 | 1171.71 | 1751.85 | 1.84 |
Command used:
```bash
|
3abbc95a
tangwang
重构(scripts): 整理sc...
|
70
|
./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
|
1d6727ac
tangwang
trans
|
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
|
--single \
--model nllb-200-distilled-600m \
--source-lang zh \
--target-lang en \
--column title_cn \
--scene sku_name \
--batch-size 1 \
--limit 100
```
Takeaways for online use:
- `batch_size=1` can be treated as single-request latency for the current service path.
- `zh -> en` is materially faster than `en -> zh` on this machine.
- NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards.
|
3eff49b7
tangwang
trans nllb-200-di...
|
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
|
## NLLB Resource Reality
The common online claim that this model uses only about `1.25GB` in `float16` is best understood as a rough weight-size level, not end-to-end runtime memory.
Actual runtime on this machine:
- loaded on `cuda:0`
- actual parameter dtype verified as `torch.float16`
- steady GPU memory after load: about `2.6 GiB`
- benchmark peak GPU memory: about `2.8-3.0 GiB`
The difference comes from:
- CUDA context
- allocator reserved memory
- runtime activations and temporary tensors
- batch size
- input length and generation length
- framework overhead
## Final Takeaways
1. `opus-mt-zh-en` remains the fastest model on this machine.
2. `opus-mt-en-zh` is slower but still very practical for bulk translation.
3. `nllb-200-distilled-600m` is now fully usable on T4 after optimization.
4. `nllb` is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.
|