# NLLB T4 Long-Text Reassessment Date: 2026-03-19 Model: `nllb-200-distilled-600m` Backend: `CTranslate2 + float16` Direction: `zh -> en` ## Why This Reassessment Exists Earlier notes mixed two different ideas: - `batch_size=64` was the highest-throughput point in the original product-title sweeps. - `batch_size=16` was only a more conservative default candidate when trying to balance throughput with tail latency for online use. That distinction was not carried forward clearly enough. We re-checked the current long-text segmented workload instead of reusing the product-title conclusion mechanically. ## Current Long-Text Workload Observed in Logs The clearest apples-to-apples evidence came from repeated uncached requests of the same long Chinese input: - input length: about `3944` to `3966` chars - segmented into `60` pieces - target language: `en` - source language: `zh` ### Log-Derived Comparison `batch_size=16` samples from [`logs/translator-2026-03-19.log`](/data/saas-search/logs/translator-2026-03-19.log): - `reqid=181f00ae` -> `1586.87 ms` - `reqid=d6c1213f` -> `1732.95 ms` - `reqid=26f8acd1` -> `4745.32 ms` `batch_size=64` samples from the same log: - `reqid=28262f1e` -> `752.96 ms` - `reqid=737fc848` -> `815.66 ms` - `reqid=8d05fa20` -> `835.25 ms` - `reqid=e29d2629` -> `3927.87 ms` - `reqid=c2b1df14` -> `4049.31 ms` ### Summary For this `~3950 char / 60 segment` workload: - `batch_size=16` - median end-to-end latency: `1732.95 ms` - median `segmentation_summary -> response`: `1672 ms` - `batch_size=64` - median end-to-end latency: `835.25 ms` - median `segmentation_summary -> response`: `782 ms` This means the steady-state inference portion was cut by about half after moving from `16` to `64`. ## Important Environment Finding This machine was not in an isolated benchmark state while re-checking: - the single T4 was shared with translator, embedding, CN-CLIP, and reranker processes - `nvidia-smi` showed about `15157 / 16384 MiB` in use during the re-check That explains the multi-second outliers in both the `16` and `64` groups. The outliers mainly appeared before the segmentation summary log, so they should be treated as shared-GPU contention noise, not pure model execution time. ## Current Config Drift During this review, the live config had already been moved again to `batch_size=256`. That larger value is not yet backed by the same quality of evidence: - for `60` segments, `256` cannot improve on `64` in any meaningful way because both already fit the whole request into one inference batch - for much larger requests such as `11847` chars and `180` segments, `256` may help, but we do not yet have a clean isolated comparison against `64` - on a shared T4, larger batches also reduce memory headroom and make benchmarking less stable ## Recommendation For the current shared-T4 deployment, keep the general NLLB default at: - `batch_size=64` - `ct2_inter_threads=4` - `ct2_max_queued_batches=32` - `ct2_batch_type=examples` - `max_new_tokens=64` - `ct2_decoding_length_mode=source` - `ct2_decoding_length_extra=8` - `ct2_decoding_length_min=32` Treat `batch_size=128` or `256` as workload-specific experiments, not as the default baseline. ## Best Practices Going Forward - Benchmark long-text segmented translation separately from product-title translation. - Use uncached repeated requests with the same long sample when checking single-request latency. - Split latency analysis into: - `request -> segmentation summary` - `segmentation summary -> response` - Do not treat shared-GPU results as a clean config ranking. - Before promoting a larger batch like `128` or `256` to default, re-run in a translator-only GPU window.