perf_reports/20260319/nllb_t4_longtext_reassessment.md

# NLLB T4 Long-Text Reassessment
Date: 2026-03-19
Model: `nllb-200-distilled-600m`
Backend: `CTranslate2 + float16`
Direction: `zh -> en`
## Why This Reassessment Exists
Earlier notes mixed two different ideas:
- `batch_size=64` was the highest-throughput point in the original product-title sweeps.
- `batch_size=16` was only a more conservative default candidate when trying to balance throughput with tail latency for online use.
That distinction was not carried forward clearly enough. We re-checked the current long-text segmented workload instead of reusing the product-title conclusion mechanically.
## Current Long-Text Workload Observed in Logs
The clearest apples-to-apples evidence came from repeated uncached requests of the same long Chinese input:
- input length: about `3944` to `3966` chars
- segmented into `60` pieces
- target language: `en`
- source language: `zh`
### Log-Derived Comparison
`batch_size=16` samples from [`logs/translator-2026-03-19.log`](/data/saas-search/logs/translator-2026-03-19.log):
- `reqid=181f00ae` -> `1586.87 ms`
- `reqid=d6c1213f` -> `1732.95 ms`
- `reqid=26f8acd1` -> `4745.32 ms`
`batch_size=64` samples from the same log:
- `reqid=28262f1e` -> `752.96 ms`
- `reqid=737fc848` -> `815.66 ms`
- `reqid=8d05fa20` -> `835.25 ms`
- `reqid=e29d2629` -> `3927.87 ms`
- `reqid=c2b1df14` -> `4049.31 ms`
### Summary
For this `~3950 char / 60 segment` workload:
- `batch_size=16`
  - median end-to-end latency: `1732.95 ms`
  - median `segmentation_summary -> response`: `1672 ms`
- `batch_size=64`
  - median end-to-end latency: `835.25 ms`
  - median `segmentation_summary -> response`: `782 ms`
This means the steady-state inference portion was cut by about half after moving from `16` to `64`.
## Important Environment Finding
This machine was not in an isolated benchmark state while re-checking:
- the single T4 was shared with translator, embedding, CN-CLIP, and reranker processes
- `nvidia-smi` showed about `15157 / 16384 MiB` in use during the re-check
That explains the multi-second outliers in both the `16` and `64` groups. The outliers mainly appeared before the segmentation summary log, so they should be treated as shared-GPU contention noise, not pure model execution time.
## Current Config Drift
During this review, the live config had already been moved again to `batch_size=256`.
That larger value is not yet backed by the same quality of evidence:
- for `60` segments, `256` cannot improve on `64` in any meaningful way because both already fit the whole request into one inference batch
- for much larger requests such as `11847` chars and `180` segments, `256` may help, but we do not yet have a clean isolated comparison against `64`
- on a shared T4, larger batches also reduce memory headroom and make benchmarking less stable
## Recommendation
For the current shared-T4 deployment, keep the general NLLB default at:
- `batch_size=64`
- `ct2_inter_threads=4`
- `ct2_max_queued_batches=32`
- `ct2_batch_type=examples`
- `max_new_tokens=64`
- `ct2_decoding_length_mode=source`
- `ct2_decoding_length_extra=8`
- `ct2_decoding_length_min=32`
Treat `batch_size=128` or `256` as workload-specific experiments, not as the default baseline.
## Best Practices Going Forward
- Benchmark long-text segmented translation separately from product-title translation.
- Use uncached repeated requests with the same long sample when checking single-request latency.
- Split latency analysis into:
  - `request -> segmentation summary`
  - `segmentation summary -> response`
- Do not treat shared-GPU results as a clean config ranking.
- Before promoting a larger batch like `128` or `256` to default, re-run in a translator-only GPU window.