Blame view

perf_reports/20260317/translation_local_models/README.md 4.22 KB
00471f80   tangwang   trans
1
2
3
4
5
6
7
8
9
10
11
  # Local Translation Model Benchmark Report
  
  Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  
  Test time: `2026-03-17`
  
  Environment:
  - GPU: `Tesla T4 16GB`
  - Driver / CUDA: `570.158.01 / 12.8`
  - Python env: `.venv-translator`
  - Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
00471f80   tangwang   trans
12
13
  
  Method:
3eff49b7   tangwang   trans nllb-200-di...
14
15
  - `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured production settings.
  - `nllb-200-distilled-600m` was benchmarked on a `500`-row subset after optimization.
1d6727ac   tangwang   trans
16
  - `nllb-200-distilled-600m` was also benchmarked with `batch_size=1` on a `100`-row subset to approximate online query translation latency.
3eff49b7   tangwang   trans nllb-200-di...
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
  - This report only keeps the final optimized results and final deployment recommendation.
  - Quality was intentionally not evaluated; this is a performance-only report.
  
  ## Final Production-Like Config
  
  For `nllb-200-distilled-600m`, the final recommended config on `Tesla T4` is:
  
  ```yaml
  nllb-200-distilled-600m:
    enabled: true
    backend: "local_nllb"
    model_id: "facebook/nllb-200-distilled-600M"
    model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
    device: "cuda"
    torch_dtype: "float16"
    batch_size: 16
    max_input_length: 256
    max_new_tokens: 64
    num_beams: 1
    attn_implementation: "sdpa"
  ```
  
  What actually helped:
  - `cuda + float16`
  - `batch_size=16`
  - `max_new_tokens=64`
  - `attn_implementation=sdpa`
  
  What did not become the final recommendation:
  - `batch_size=32`
    Throughput can improve further, but tail latency degrades too much for a balanced default.
  
  ## Final Results
  
  | Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
  |---|---|---:|---:|---:|---:|---:|---:|---:|---:|
  | `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
  | `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
  | `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
  | `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
  
1d6727ac   tangwang   trans
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
  ## Single-Request Latency
  
  To model online search query translation, we reran NLLB with `batch_size=1`. In this mode, batch latency is request latency.
  
  | Model | Direction | Rows | Load s | Translate s | Avg req ms | Req p50 ms | Req p95 ms | Req max ms | Items/s |
  |---|---|---:|---:|---:|---:|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | 100 | 6.8390 | 32.1909 | 321.909 | 292.54 | 624.12 | 819.67 | 3.11 |
  | `nllb-200-distilled-600m` | `en -> zh` | 100 | 6.8249 | 54.2470 | 542.470 | 481.61 | 1171.71 | 1751.85 | 1.84 |
  
  Command used:
  
  ```bash
  ./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
    --single \
    --model nllb-200-distilled-600m \
    --source-lang zh \
    --target-lang en \
    --column title_cn \
    --scene sku_name \
    --batch-size 1 \
    --limit 100
  ```
  
  Takeaways for online use:
  - `batch_size=1` can be treated as single-request latency for the current service path.
  - `zh -> en` is materially faster than `en -> zh` on this machine.
  - NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards.
  
3eff49b7   tangwang   trans nllb-200-di...
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
  ## NLLB Resource Reality
  
  The common online claim that this model uses only about `1.25GB` in `float16` is best understood as a rough weight-size level, not end-to-end runtime memory.
  
  Actual runtime on this machine:
  - loaded on `cuda:0`
  - actual parameter dtype verified as `torch.float16`
  - steady GPU memory after load: about `2.6 GiB`
  - benchmark peak GPU memory: about `2.8-3.0 GiB`
  
  The difference comes from:
  - CUDA context
  - allocator reserved memory
  - runtime activations and temporary tensors
  - batch size
  - input length and generation length
  - framework overhead
  
  ## Final Takeaways
  
  1. `opus-mt-zh-en` remains the fastest model on this machine.
  2. `opus-mt-en-zh` is slower but still very practical for bulk translation.
  3. `nllb-200-distilled-600m` is now fully usable on T4 after optimization.
  4. `nllb` is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.