ai-saas / saas-search

18 Mar, 2026

3 commits

Implemented CTranslate2 for the three local translation models and
switched the existing local_nllb / local_marian factories over to it.
The new runtime lives in local_ctranslate2.py, including HF->CT2
auto-conversion, float16 compute type mapping, Marian direction
handling, and NLLB target-prefix decoding. The service wiring is in
service.py (line 113), and the three model configs now point at explicit
ctranslate2-float16 dirs in config.yaml (line 133).

I also updated the setup path so this is usable end-to-end:
ctranslate2>=4.7.0 was added to requirements_translator_service.txt and
requirements.txt, the download script now supports pre-conversion in
download_translation_models.py (line 27), and the docs/config examples
were refreshed in translation/README.md. I installed ctranslate2 into
.venv-translator, pre-converted all three models, and the CT2 artifacts
are now already on disk:

models/translation/facebook/nllb-200-distilled-600M/ctranslate2-float16
models/translation/Helsinki-NLP/opus-mt-zh-en/ctranslate2-float16
models/translation/Helsinki-NLP/opus-mt-en-zh/ctranslate2-float16
Verification was solid. python3 -m compileall passed, direct
TranslationService smoke tests ran successfully in .venv-translator, and
the focused NLLB benchmark on the local GPU showed a clear win:

batch_size=16: HF 0.347s/batch, 46.1 items/s vs CT2 0.130s/batch, 123.0
items/s
batch_size=1: HF 0.396s/request vs CT2 0.126s/request
One caveat: translation quality on some very short phrases, especially
opus-mt-en-zh, still looks a bit rough in smoke tests, so I’d run your
real quality set before fully cutting over. If you want, I can take the
next step and update the benchmark script/report so you have a fresh
full CT2 performance report for all three models.

2026-03-18 23:15:46 +0800

2a6d9d76 更新了压测脚本和文档，让“单条请求 / 单流 batch / 并发 / ... Browse Dir »

batch×并发矩阵”彻底分开展示。

改动在这几处：

scripts/benchmark_translation_local_models.py：新增 --suite
extended，支持
batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size *
concurrency <= 128
的组合矩阵；并且单场景模式现在只加载目标模型，load_seconds
更干净，也支持 --disable-cache。
translation/README.md：把性能章节拆成了
batch_sweep、concurrency_sweep、batch x concurrency matrix
三块，补了这次复测的参数、复现命令和摘要表。
perf_reports/20260318/translation_local_models/README.md：新增本轮补测摘要。
完整结果在 translation_local_models_extended_221846.md 和
translation_local_models_extended_221846.json。
这次补测的核心结论很明确：

在线单条请求应该看 concurrency_sweep，也就是固定 batch_size=1 的表。
离线批量吞吐应该看 batch_sweep，4 个方向的最高 raw throughput 都出现在
batch_size=64，但更均衡的默认值仍更像 batch_size=16。
当前本地 seq2seq backend
有单模型锁，提升客户端并发几乎不涨吞吐，主要是把排队时间变成更高的
p95；所以并发更像“延迟预算”问题，不是“扩容吞吐”手段。
本轮在线单条里最快的是 opus-mt-zh-en；最慢、且并发放大最明显的是
nllb-200-distilled-600m en->zh。

2026-03-18 22:49:15 +0800

cd4ce66d trans logs Browse Dir »

tangwang
2026-03-18 20:32:37 +0800

17 Mar, 2026

5 commits

1d6727ac trans Browse Dir »

tangwang
2026-03-17 22:06:54 +0800
3eff49b7 trans nllb-200-distilled-600M性能提升 Browse Dir »

tangwang
2026-03-17 21:29:18 +0800
00471f80 trans Browse Dir »

tangwang
2026-03-17 20:13:32 +0800
0fd2f875 translate Browse Dir »

tangwang
2026-03-17 19:21:34 +0800

5e4dc8e4 翻译架构按“一个翻译服务 + ... Browse Dir »

多个独立翻译能力”重构。现在业务侧不再把翻译当 provider
选型，QueryParser 和 indexer 统一通过 6006 的 translator service client
调用；真正的能力选择、启用开关、model + scene 路由，都收口到服务端和新的
translation/ 目录里了。

这次的核心改动在
config/services_config.py、providers/translation.py、api/translator_app.py、config/config.yaml
和新的 translation/service.py。配置从旧的
services.translation.provider/providers 改成了 service_url +
default_model + default_scene + capabilities，每个能力可独立
enabled；服务端新增了统一的 backend 管理与懒加载，真实实现集中到
translation/backends/qwen_mt.py、translation/backends/llm.py、translation/backends/deepl.py，旧的
query/qwen_mt_translate.py、query/llm_translate.py、query/deepl_provider.py
只保留兼容导出。接口上，/translate 现在标准支持 scene，context
作为兼容别名继续可用，健康检查会返回默认模型、默认场景和已启用能力。

2026-03-17 15:50:53 +0800