3eff49b7
tangwang
trans nllb-200-di...
|
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
- `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured production settings.
- `nllb-200-distilled-600m` was benchmarked on a `500`-row subset after optimization.
- This report only keeps the final optimized results and final deployment recommendation.
- Quality was intentionally not evaluated; this is a performance-only report.
## Final Production-Like Config
For `nllb-200-distilled-600m`, the final recommended config on `Tesla T4` is:
```yaml
nllb-200-distilled-600m:
enabled: true
backend: "local_nllb"
model_id: "facebook/nllb-200-distilled-600M"
model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
device: "cuda"
torch_dtype: "float16"
batch_size: 16
max_input_length: 256
max_new_tokens: 64
num_beams: 1
attn_implementation: "sdpa"
```
What actually helped:
- `cuda + float16`
- `batch_size=16`
- `max_new_tokens=64`
- `attn_implementation=sdpa`
What did not become the final recommendation:
- `batch_size=32`
Throughput can improve further, but tail latency degrades too much for a balanced default.
## Final Results
| Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
|---|---|---:|---:|---:|---:|---:|---:|---:|---:|
| `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
| `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
| `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
| `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
## NLLB Resource Reality
The common online claim that this model uses only about `1.25GB` in `float16` is best understood as a rough weight-size level, not end-to-end runtime memory.
Actual runtime on this machine:
- loaded on `cuda:0`
- actual parameter dtype verified as `torch.float16`
- steady GPU memory after load: about `2.6 GiB`
- benchmark peak GPU memory: about `2.8-3.0 GiB`
The difference comes from:
- CUDA context
- allocator reserved memory
- runtime activations and temporary tensors
- batch size
- input length and generation length
- framework overhead
## Final Takeaways
1. `opus-mt-zh-en` remains the fastest model on this machine.
2. `opus-mt-en-zh` is slower but still very practical for bulk translation.
3. `nllb-200-distilled-600m` is now fully usable on T4 after optimization.
4. `nllb` is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.
|