Blame view

perf_reports/20260319/nllb_t4_longtext_reassessment.md 3.64 KB
4747e2f4   tangwang   embedding perform...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
  # NLLB T4 Long-Text Reassessment
  
  Date: 2026-03-19
  Model: `nllb-200-distilled-600m`
  Backend: `CTranslate2 + float16`
  Direction: `zh -> en`
  
  ## Why This Reassessment Exists
  
  Earlier notes mixed two different ideas:
  
  - `batch_size=64` was the highest-throughput point in the original product-title sweeps.
  - `batch_size=16` was only a more conservative default candidate when trying to balance throughput with tail latency for online use.
  
  That distinction was not carried forward clearly enough. We re-checked the current long-text segmented workload instead of reusing the product-title conclusion mechanically.
  
  ## Current Long-Text Workload Observed in Logs
  
  The clearest apples-to-apples evidence came from repeated uncached requests of the same long Chinese input:
  
  - input length: about `3944` to `3966` chars
  - segmented into `60` pieces
  - target language: `en`
  - source language: `zh`
  
  ### Log-Derived Comparison
  
  `batch_size=16` samples from [`logs/translator-2026-03-19.log`](/data/saas-search/logs/translator-2026-03-19.log):
  
  - `reqid=181f00ae` -> `1586.87 ms`
  - `reqid=d6c1213f` -> `1732.95 ms`
  - `reqid=26f8acd1` -> `4745.32 ms`
  
  `batch_size=64` samples from the same log:
  
  - `reqid=28262f1e` -> `752.96 ms`
  - `reqid=737fc848` -> `815.66 ms`
  - `reqid=8d05fa20` -> `835.25 ms`
  - `reqid=e29d2629` -> `3927.87 ms`
  - `reqid=c2b1df14` -> `4049.31 ms`
  
  ### Summary
  
  For this `~3950 char / 60 segment` workload:
  
  - `batch_size=16`
    - median end-to-end latency: `1732.95 ms`
    - median `segmentation_summary -> response`: `1672 ms`
  - `batch_size=64`
    - median end-to-end latency: `835.25 ms`
    - median `segmentation_summary -> response`: `782 ms`
  
  This means the steady-state inference portion was cut by about half after moving from `16` to `64`.
  
  ## Important Environment Finding
  
  This machine was not in an isolated benchmark state while re-checking:
  
  - the single T4 was shared with translator, embedding, CN-CLIP, and reranker processes
  - `nvidia-smi` showed about `15157 / 16384 MiB` in use during the re-check
  
  That explains the multi-second outliers in both the `16` and `64` groups. The outliers mainly appeared before the segmentation summary log, so they should be treated as shared-GPU contention noise, not pure model execution time.
  
  ## Current Config Drift
  
  During this review, the live config had already been moved again to `batch_size=256`.
  
  That larger value is not yet backed by the same quality of evidence:
  
  - for `60` segments, `256` cannot improve on `64` in any meaningful way because both already fit the whole request into one inference batch
  - for much larger requests such as `11847` chars and `180` segments, `256` may help, but we do not yet have a clean isolated comparison against `64`
  - on a shared T4, larger batches also reduce memory headroom and make benchmarking less stable
  
  ## Recommendation
  
  For the current shared-T4 deployment, keep the general NLLB default at:
  
  - `batch_size=64`
  - `ct2_inter_threads=4`
  - `ct2_max_queued_batches=32`
  - `ct2_batch_type=examples`
  - `max_new_tokens=64`
  - `ct2_decoding_length_mode=source`
  - `ct2_decoding_length_extra=8`
  - `ct2_decoding_length_min=32`
  
  Treat `batch_size=128` or `256` as workload-specific experiments, not as the default baseline.
  
  ## Best Practices Going Forward
  
  - Benchmark long-text segmented translation separately from product-title translation.
  - Use uncached repeated requests with the same long sample when checking single-request latency.
  - Split latency analysis into:
    - `request -> segmentation summary`
    - `segmentation summary -> response`
  - Do not treat shared-GPU results as a clean config ranking.
  - Before promoting a larger batch like `128` or `256` to default, re-run in a translator-only GPU window.