nllb_t4_longtext_reassessment.md 3.64 KB
Edit Raw Blame History


NLLB T4 Long-Text Reassessment
Date: 2026-03-19
Model: nllb-200-distilled-600m
Backend: CTranslate2 + float16
Direction: zh -> en
Why This Reassessment Exists
Earlier notes mixed two different ideas:


batch_size=64 was the highest-throughput point in the original product-title sweeps.
batch_size=16 was only a more conservative default candidate when trying to balance throughput with tail latency for online use.


That distinction was not carried forward clearly enough. We re-checked the current long-text segmented workload instead of reusing the product-title conclusion mechanically.
Current Long-Text Workload Observed in Logs
The clearest apples-to-apples evidence came from repeated uncached requests of the same long Chinese input:


input length: about 3944 to 3966 chars
segmented into 60 pieces
target language: en
source language: zh

Log-Derived Comparison
batch_size=16 samples from <code>logs/translator-2026-03-19.log</code>:


reqid=181f00ae -> 1586.87 ms
reqid=d6c1213f -> 1732.95 ms
reqid=26f8acd1 -> 4745.32 ms


batch_size=64 samples from the same log:


reqid=28262f1e -> 752.96 ms
reqid=737fc848 -> 815.66 ms
reqid=8d05fa20 -> 835.25 ms
reqid=e29d2629 -> 3927.87 ms
reqid=c2b1df14 -> 4049.31 ms

Summary
For this ~3950 char / 60 segment workload:


batch_size=16


median end-to-end latency: 1732.95 ms
median segmentation_summary -> response: 1672 ms

batch_size=64


median end-to-end latency: 835.25 ms
median segmentation_summary -> response: 782 ms


This means the steady-state inference portion was cut by about half after moving from 16 to 64.
Important Environment Finding
This machine was not in an isolated benchmark state while re-checking:


the single T4 was shared with translator, embedding, CN-CLIP, and reranker processes
nvidia-smi showed about 15157 / 16384 MiB in use during the re-check


That explains the multi-second outliers in both the 16 and 64 groups. The outliers mainly appeared before the segmentation summary log, so they should be treated as shared-GPU contention noise, not pure model execution time.
Current Config Drift
During this review, the live config had already been moved again to batch_size=256.

That larger value is not yet backed by the same quality of evidence:


for 60 segments, 256 cannot improve on 64 in any meaningful way because both already fit the whole request into one inference batch
for much larger requests such as 11847 chars and 180 segments, 256 may help, but we do not yet have a clean isolated comparison against 64
on a shared T4, larger batches also reduce memory headroom and make benchmarking less stable

Recommendation
For the current shared-T4 deployment, keep the general NLLB default at:


batch_size=64
ct2_inter_threads=4
ct2_max_queued_batches=32
ct2_batch_type=examples
max_new_tokens=64
ct2_decoding_length_mode=source
ct2_decoding_length_extra=8
ct2_decoding_length_min=32


Treat batch_size=128 or 256 as workload-specific experiments, not as the default baseline.
Best Practices Going Forward

Benchmark long-text segmented translation separately from product-title translation.
Use uncached repeated requests with the same long sample when checking single-request latency.
Split latency analysis into:


request -> segmentation summary
segmentation summary -> response

Do not treat shared-GPU results as a clean config ranking.
Before promoting a larger batch like 128 or 256 to default, re-run in a translator-only GPU window.