ai-saas / saas-search

19 Mar, 2026

3 commits

4747e2f4 embedding performance ... Browse File »

The instability is very likely real overload, but `lsof -i :6005 | wc -l
= 75` alone does not prove it. What does matter is the live shape of the
service: it is a single `uvicorn` worker on port `6005`, and the code
had one shared process handling both text and image requests, with image
work serialized behind a single lock. Under bursty image traffic,
requests could pile up and sit blocked with almost no useful tracing,
which matches the “only blocking observed” symptom.

now adds persistent log files, request IDs, per-request
request/response/failure logs, text microbatch dispatch logs, health
stats with active/rejected counts, and explicit overload admission
control. New knobs are `TEXT_MAX_INFLIGHT`, `IMAGE_MAX_INFLIGHT`, and
`EMBEDDING_OVERLOAD_STATUS_CODE`. Startup output now shows those limits
and log paths in
[scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L80).
I also added focused tests in
[tests/test_embedding_service_limits.py](/data/saas-search/tests/test_embedding_service_limits.py#L1).

What this means operationally:
- Text and image are still in one process, so this is not the final
  architecture.
- But image spikes will now be rejected quickly once the image lane is
  full instead of sitting around and consuming the worker pool.
- Logs will now show each request, each rejection, each microbatch
  dispatch, backend time, response time, and request ID.

Verification:
- Passed: `.venv/bin/python -m pytest -q
  tests/test_embedding_service_limits.py`
- I also ran a wider test command, but 3 failures came from pre-existing
  drift in
[tests/test_embedding_pipeline.py](/data/saas-search/tests/test_embedding_pipeline.py#L95),
where the tests still monkeypatch `embeddings.text_encoder.redis.Redis`
even though
[embeddings/text_encoder.py](/data/saas-search/embeddings/text_encoder.py#L1)
no longer imports `redis` that way.

已把 CLIP_AS_SERVICE 的默认模型切到
ViT-L-14，并把这套配置收口成可变更的统一入口了。现在默认值在
embeddings/config.py (line 29) 的 CLIP_AS_SERVICE_MODEL_NAME，当前为
CN-CLIP/ViT-L-14；scripts/start_cnclip_service.sh (line 37)
会自动读取这个配置，不再把默认模型写死在脚本里，同时支持
CNCLIP_MODEL_NAME 和 --model-name
临时覆盖。scripts/start_embedding_service.sh (line 29) 和
embeddings/server.py (line 425)
也补了模型信息输出，方便排查实际连接的配置。

文档也一起更新了，重点在 docs/CNCLIP_SERVICE说明文档.md (line 62) 和
embeddings/README.md (line
58)：现在说明的是“以配置为准、可覆盖”的机制，而不是写死某个模型名；相关总结文档和内部说明也同步改成了配置驱动表述。

2026-03-19 12:27:05 +0800

14e67b71 分句后的 batching 现在是“先全量分句，再按 segment 总数按模型 batch_size ... Browse File »

推理”，不再是先按原始输入条数切块。也就是说，如果 100 条请求分句后变成
150 个 segments，batch_size=64 时会按 64 + 64 + 22
三批推理，推理完再按原始分句计划合并并还原成 100 条返回。这个改动在
local_seq2seq.py (line 241) 和 local_ctranslate2.py (line 391)。

日志这边也补上了两层你要的关键信息：

分句摘要日志：Translation segmentation
summary，会打印输入条数、非空条数、发生分句的输入数、总 segments
数、当前 batch_size、每条输入分成多少段的统计，见 local_seq2seq.py (line
216) 和 local_ctranslate2.py (line 366)。
每个预测批次日志：Translation inference
batch，会打印第几批、总批数、该批 segment
数、长度统计、首条预览。CTranslate2 另外还会打印 Translation model batch
detail，补充 token 长度和 max_decoding_length，见 local_ctranslate2.py
(line 294)。
我也补了测试，覆盖了“分句后再
batching”和“日志中有分句摘要与每批推理日志”，在
test_translation_local_backends.py (line 358)。

2026-03-19 10:54:30 +0800

294c3d0a 实现第一版“按模型预算智能分句”的基础能力。 ... Browse File »

改动：

新增分句与预算工具：translation/text_splitter.py
接入 HF 本地后端：translation/backends/local_seq2seq.py (line 157)
接入 CT2 本地后端：translation/backends/local_ctranslate2.py (line 301)
补了测试：tests/test_translation_local_backends.py
我先把代码里实际限制梳理了一遍，关键配置在 config/config.yaml (line
133)：

nllb-200-distilled-600m: max_input_length=256，max_new_tokens=64，并且是
ct2_decoding_length_mode=source +
extra=8。现在按这个配置计算出的保守输入预算是 56 token。
opus-mt-zh-en:
max_input_length=256，max_new_tokens=256。现在保守输入预算是 248 token。
opus-mt-en-zh: 同上，也是 248 token。
这版分句策略是：

先按强边界切：。！？!?；;…、换行、英文句号
不够再按弱边界切：，,、：:()（）[]【】/|
再不够才按空白切
最后才做 token 预算下的硬切
超长时会“分句翻译后再回拼”，中文目标语言默认无空格回拼，英文等默认按空格回拼，尽量别切太碎
验证：

python3 -m compileall translation
tests/test_translation_local_backends.py 已通过

2026-03-19 09:51:06 +0800

17 Mar, 2026

3 commits

1d6727ac trans Browse File »

tangwang
2026-03-17 22:06:54 +0800
3eff49b7 trans nllb-200-distilled-600M性能提升 Browse File »

tangwang
2026-03-17 21:29:18 +0800
0fd2f875 translate Browse File »

tangwang
2026-03-17 19:21:34 +0800