nllb_t4_longtext_reassessment.md 3.64 KB

NLLB T4 Long-Text Reassessment

Date: 2026-03-19 Model: nllb-200-distilled-600m Backend: CTranslate2 + float16 Direction: zh -> en

Why This Reassessment Exists

Earlier notes mixed two different ideas:

  • batch_size=64 was the highest-throughput point in the original product-title sweeps.
  • batch_size=16 was only a more conservative default candidate when trying to balance throughput with tail latency for online use.

That distinction was not carried forward clearly enough. We re-checked the current long-text segmented workload instead of reusing the product-title conclusion mechanically.

Current Long-Text Workload Observed in Logs

The clearest apples-to-apples evidence came from repeated uncached requests of the same long Chinese input:

  • input length: about 3944 to 3966 chars
  • segmented into 60 pieces
  • target language: en
  • source language: zh

Log-Derived Comparison

batch_size=16 samples from <code>logs/translator-2026-03-19.log</code>:

  • reqid=181f00ae -> 1586.87 ms
  • reqid=d6c1213f -> 1732.95 ms
  • reqid=26f8acd1 -> 4745.32 ms

batch_size=64 samples from the same log:

  • reqid=28262f1e -> 752.96 ms
  • reqid=737fc848 -> 815.66 ms
  • reqid=8d05fa20 -> 835.25 ms
  • reqid=e29d2629 -> 3927.87 ms
  • reqid=c2b1df14 -> 4049.31 ms

Summary

For this ~3950 char / 60 segment workload:

  • batch_size=16
    • median end-to-end latency: 1732.95 ms
    • median segmentation_summary -> response: 1672 ms
  • batch_size=64
    • median end-to-end latency: 835.25 ms
    • median segmentation_summary -> response: 782 ms

This means the steady-state inference portion was cut by about half after moving from 16 to 64.

Important Environment Finding

This machine was not in an isolated benchmark state while re-checking:

  • the single T4 was shared with translator, embedding, CN-CLIP, and reranker processes
  • nvidia-smi showed about 15157 / 16384 MiB in use during the re-check

That explains the multi-second outliers in both the 16 and 64 groups. The outliers mainly appeared before the segmentation summary log, so they should be treated as shared-GPU contention noise, not pure model execution time.

Current Config Drift

During this review, the live config had already been moved again to batch_size=256.

That larger value is not yet backed by the same quality of evidence:

  • for 60 segments, 256 cannot improve on 64 in any meaningful way because both already fit the whole request into one inference batch
  • for much larger requests such as 11847 chars and 180 segments, 256 may help, but we do not yet have a clean isolated comparison against 64
  • on a shared T4, larger batches also reduce memory headroom and make benchmarking less stable

Recommendation

For the current shared-T4 deployment, keep the general NLLB default at:

  • batch_size=64
  • ct2_inter_threads=4
  • ct2_max_queued_batches=32
  • ct2_batch_type=examples
  • max_new_tokens=64
  • ct2_decoding_length_mode=source
  • ct2_decoding_length_extra=8
  • ct2_decoding_length_min=32

Treat batch_size=128 or 256 as workload-specific experiments, not as the default baseline.

Best Practices Going Forward

  • Benchmark long-text segmented translation separately from product-title translation.
  • Use uncached repeated requests with the same long sample when checking single-request latency.
  • Split latency analysis into:
    • request -> segmentation summary
    • segmentation summary -> response
  • Do not treat shared-GPU results as a clean config ranking.
  • Before promoting a larger batch like 128 or 256 to default, re-run in a translator-only GPU window.