NLLB T4 Long-Text Reassessment
Date: 2026-03-19
Model: nllb-200-distilled-600m
Backend: CTranslate2 + float16
Direction: zh -> en
Why This Reassessment Exists
Earlier notes mixed two different ideas:
batch_size=64was the highest-throughput point in the original product-title sweeps.batch_size=16was only a more conservative default candidate when trying to balance throughput with tail latency for online use.
That distinction was not carried forward clearly enough. We re-checked the current long-text segmented workload instead of reusing the product-title conclusion mechanically.
Current Long-Text Workload Observed in Logs
The clearest apples-to-apples evidence came from repeated uncached requests of the same long Chinese input:
- input length: about
3944to3966chars - segmented into
60pieces - target language:
en - source language:
zh
Log-Derived Comparison
batch_size=16 samples from <code>logs/translator-2026-03-19.log</code>:
reqid=181f00ae->1586.87 msreqid=d6c1213f->1732.95 msreqid=26f8acd1->4745.32 ms
batch_size=64 samples from the same log:
reqid=28262f1e->752.96 msreqid=737fc848->815.66 msreqid=8d05fa20->835.25 msreqid=e29d2629->3927.87 msreqid=c2b1df14->4049.31 ms
Summary
For this ~3950 char / 60 segment workload:
batch_size=16- median end-to-end latency:
1732.95 ms - median
segmentation_summary -> response:1672 ms
- median end-to-end latency:
batch_size=64- median end-to-end latency:
835.25 ms - median
segmentation_summary -> response:782 ms
- median end-to-end latency:
This means the steady-state inference portion was cut by about half after moving from 16 to 64.
Important Environment Finding
This machine was not in an isolated benchmark state while re-checking:
- the single T4 was shared with translator, embedding, CN-CLIP, and reranker processes
nvidia-smishowed about15157 / 16384 MiBin use during the re-check
That explains the multi-second outliers in both the 16 and 64 groups. The outliers mainly appeared before the segmentation summary log, so they should be treated as shared-GPU contention noise, not pure model execution time.
Current Config Drift
During this review, the live config had already been moved again to batch_size=256.
That larger value is not yet backed by the same quality of evidence:
- for
60segments,256cannot improve on64in any meaningful way because both already fit the whole request into one inference batch - for much larger requests such as
11847chars and180segments,256may help, but we do not yet have a clean isolated comparison against64 - on a shared T4, larger batches also reduce memory headroom and make benchmarking less stable
Recommendation
For the current shared-T4 deployment, keep the general NLLB default at:
batch_size=64ct2_inter_threads=4ct2_max_queued_batches=32ct2_batch_type=examplesmax_new_tokens=64ct2_decoding_length_mode=sourcect2_decoding_length_extra=8ct2_decoding_length_min=32
Treat batch_size=128 or 256 as workload-specific experiments, not as the default baseline.
Best Practices Going Forward
- Benchmark long-text segmented translation separately from product-title translation.
- Use uncached repeated requests with the same long sample when checking single-request latency.
- Split latency analysis into:
request -> segmentation summarysegmentation summary -> response
- Do not treat shared-GPU results as a clean config ranking.
- Before promoting a larger batch like
128or256to default, re-run in a translator-only GPU window.