Blame view

perf_reports/20260317/translation_local_models/README.md 4.26 KB
00471f80   tangwang   trans
1
2
  # Local Translation Model Benchmark Report
  
3abbc95a   tangwang   重构(scripts): 整理sc...
3
  Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
00471f80   tangwang   trans
4
5
6
7
8
9
10
11
  
  Test time: `2026-03-17`
  
  Environment:
  - GPU: `Tesla T4 16GB`
  - Driver / CUDA: `570.158.01 / 12.8`
  - Python env: `.venv-translator`
  - Dataset: [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
00471f80   tangwang   trans
12
13
  
  Method:
3eff49b7   tangwang   trans nllb-200-di...
14
15
  - `opus-mt-zh-en` and `opus-mt-en-zh` were benchmarked on the full dataset using their configured production settings.
  - `nllb-200-distilled-600m` was benchmarked on a `500`-row subset after optimization.
1d6727ac   tangwang   trans
16
  - `nllb-200-distilled-600m` was also benchmarked with `batch_size=1` on a `100`-row subset to approximate online query translation latency.
3eff49b7   tangwang   trans nllb-200-di...
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
  - This report only keeps the final optimized results and final deployment recommendation.
  - Quality was intentionally not evaluated; this is a performance-only report.
  
  ## Final Production-Like Config
  
  For `nllb-200-distilled-600m`, the final recommended config on `Tesla T4` is:
  
  ```yaml
  nllb-200-distilled-600m:
    enabled: true
    backend: "local_nllb"
    model_id: "facebook/nllb-200-distilled-600M"
    model_dir: "./models/translation/facebook/nllb-200-distilled-600M"
    device: "cuda"
    torch_dtype: "float16"
    batch_size: 16
    max_input_length: 256
    max_new_tokens: 64
    num_beams: 1
    attn_implementation: "sdpa"
  ```
  
  What actually helped:
  - `cuda + float16`
  - `batch_size=16`
  - `max_new_tokens=64`
  - `attn_implementation=sdpa`
  
  What did not become the final recommendation:
  - `batch_size=32`
    Throughput can improve further, but tail latency degrades too much for a balanced default.
  
  ## Final Results
  
  | Model | Direction | Device | Rows | Load s | Translate s | Items/s | Avg item ms | Batch p50 ms | Batch p95 ms |
  |---|---|---:|---:|---:|---:|---:|---:|---:|---:|
  | `opus-mt-zh-en` | `zh -> en` | `cuda` | 18,576 | 3.1435 | 497.7513 | 37.32 | 26.795 | 301.99 | 1835.81 |
  | `opus-mt-en-zh` | `en -> zh` | `cuda` | 18,576 | 3.1867 | 987.3994 | 18.81 | 53.155 | 449.14 | 2012.12 |
  | `nllb-200-distilled-600m` | `zh -> en` | `cuda` | 500 | 7.3397 | 25.9577 | 19.26 | 51.915 | 832.64 | 1263.01 |
  | `nllb-200-distilled-600m` | `en -> zh` | `cuda` | 500 | 7.4152 | 42.0405 | 11.89 | 84.081 | 1093.87 | 2107.44 |
  
1d6727ac   tangwang   trans
58
59
60
61
62
63
64
65
66
67
68
69
  ## Single-Request Latency
  
  To model online search query translation, we reran NLLB with `batch_size=1`. In this mode, batch latency is request latency.
  
  | Model | Direction | Rows | Load s | Translate s | Avg req ms | Req p50 ms | Req p95 ms | Req max ms | Items/s |
  |---|---|---:|---:|---:|---:|---:|---:|---:|---:|
  | `nllb-200-distilled-600m` | `zh -> en` | 100 | 6.8390 | 32.1909 | 321.909 | 292.54 | 624.12 | 819.67 | 3.11 |
  | `nllb-200-distilled-600m` | `en -> zh` | 100 | 6.8249 | 54.2470 | 542.470 | 481.61 | 1171.71 | 1751.85 | 1.84 |
  
  Command used:
  
  ```bash
3abbc95a   tangwang   重构(scripts): 整理sc...
70
  ./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
1d6727ac   tangwang   trans
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
    --single \
    --model nllb-200-distilled-600m \
    --source-lang zh \
    --target-lang en \
    --column title_cn \
    --scene sku_name \
    --batch-size 1 \
    --limit 100
  ```
  
  Takeaways for online use:
  - `batch_size=1` can be treated as single-request latency for the current service path.
  - `zh -> en` is materially faster than `en -> zh` on this machine.
  - NLLB is usable for online query translation, but it is not a low-latency model by search-serving standards.
  
3eff49b7   tangwang   trans nllb-200-di...
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
  ## NLLB Resource Reality
  
  The common online claim that this model uses only about `1.25GB` in `float16` is best understood as a rough weight-size level, not end-to-end runtime memory.
  
  Actual runtime on this machine:
  - loaded on `cuda:0`
  - actual parameter dtype verified as `torch.float16`
  - steady GPU memory after load: about `2.6 GiB`
  - benchmark peak GPU memory: about `2.8-3.0 GiB`
  
  The difference comes from:
  - CUDA context
  - allocator reserved memory
  - runtime activations and temporary tensors
  - batch size
  - input length and generation length
  - framework overhead
  
  ## Final Takeaways
  
  1. `opus-mt-zh-en` remains the fastest model on this machine.
  2. `opus-mt-en-zh` is slower but still very practical for bulk translation.
  3. `nllb-200-distilled-600m` is now fully usable on T4 after optimization.
  4. `nllb` is still slower than the two Marian models, but it is the better choice when broad multilingual coverage matters more than peak throughput.