ai-saas / saas-search

25 Mar, 2026

9 commits

这两个配置、四种情况：
backend:  qwen3_vllm | qwen3_vllm_score
instruction_format: compact | standard

调用 python scripts/benchmark_reranker_random_titles.py
100,200,400,600,800,1000 --repeat 5
产出性能测试报告

平均延迟（ms，客户端 POST /rerank 墙钟，--seed 99）
backend	instruction_format	n=100	n=200	n=400	n=600	n=800
n=1000
qwen3_vllm	compact	213.5	418.0	861.4	1263.4	1744.3	2162.2
qwen3_vllm	standard	254.9	475.4	909.7	1353.2	1912.5
2406.7
qwen3_vllm_score	compact	239.2	480.2	966.2	1433.5	1937.2
2428.4
qwen3_vllm_score	standard	299.6	591.8	1178.9	1773.7
2341.6	2931.7
归纳： 在本机 T4、当前 vLLM 与上述
YAML（max_model_len=160、infer_batch_size=100 等）下，两种后端都是
compact 快于 standard；整体最快为 qwen3_vllm + compact（n=1000 ≈
2.16 s），最慢为 qwen3_vllm_score + standard（≈ 2.93 s）。其他 GPU /
vLLM 版本下排序可能变化。

2026-03-25 19:15:56 +0800

749d78c8 支持 reranker精简instruction Browse Dir »

tangwang
2026-03-25 18:14:10 +0800
4823f463 qwen3_vllm_score + 独立 0.18 环境 Browse Dir »

tangwang
2026-03-25 17:24:00 +0800
9de5ef49 qwen3_vllm_score : task="score" +（原版 + hf_overrides）或 HuggingFace 上已转好的 seq-cls 模型。generate() Browse Dir »

tangwang
2026-03-25 16:11:12 +0800
5c21a485 qwen3-reranker-0.6b-gguf Browse Dir »

tangwang
2026-03-25 15:04:48 +0800
3d508beb reranker-4b-gguf Browse Dir »

tangwang
2026-03-25 12:23:14 +0800
87cacb1b 融合公式优化。加入意图匹配因子 Browse Dir »

tangwang
2026-03-25 10:58:56 +0800
837d5d76 sku筛选匹配规则优化，按 token/短语序列匹配，fixbadcase Browse Dir »

tangwang
2026-03-25 10:41:36 +0800

b712a831 意图识别策略和性能优化 ... Browse Dir »

@config/dictionaries/style_intent_color.csv
@config/dictionaries/style_intent_size.csv @query/style_intent.py
@search/sku_intent_selector.py
1. 两个csv词典，分为三列，
- 英文关键词
- 中文关键词
- 标准属性名称词
三列都可以允许逗号分割。补充的第三列使用在商品属性中，使用的是标准的英文名称
2.
判断意图的时候，中文词用中文翻译名去匹配，如果不存在中文翻译名，则用原始
query，英文词同理
3. SKU 选择的时候，用每一个 SKU 的属性名去匹配。
匹配规则要大幅度简化，并做性能优化：
1）文本匹配规则只需要看规范化后的属性值是否包含了词典配置的第三列"标准属性名称词"，如果包含了，则认为匹配成功。
找到第一个匹配成功的即可。如果都没有成功，后面也不再需要用向量匹配。
暂时废弃向量匹配、双向匹配等复杂逻辑。

2026-03-25 09:33:16 +0800

24 Mar, 2026

3 commits

74fdf9bd 1. ... Browse Dir »

加了一个过滤/降权词典，query中有独立分词匹配到指定的触发词，将过滤带某些分词的商品（比如fitted/修身，过滤宽松、loose、relaxed、baggy,slouchy等商品）
2. reranker的query使用翻译后的

2026-03-24 22:54:38 +0800

6adbf18a reranker提示词优化 Browse Dir »

tangwang
2026-03-24 20:47:52 +0800
814e352b 乘法公式配置化 Browse Dir »

tangwang
2026-03-24 12:44:11 +0800

23 Mar, 2026

5 commits

cda1cd62 意图分析&应用 baseline Browse Dir »

tangwang
2026-03-23 22:35:20 +0800
dad3c867 configs Browse Dir »

tangwang
2026-03-23 19:59:49 +0800
445496cd fix last up: 每个翻译结果的检索表达式，单个multimatch -> ... Browse Dir »
```
combined_fields+best_field+phrase_boost
```
tangwang
2026-03-23 15:20:29 +0800

e756b18e 重构了文本召回构建器，现在每个 base_query / base_query_trans_* ... Browse Dir »

子句都变成了一个带有以下结构的命名布尔查询：

must：combined_fields

should：加权后的 best_fields 和 phrase 子句

主要改动位于
search/es_query_builder.py，但此次调整沿用了现有语言路由设计，并未引入一次性分支。额外的
should 子句权重现在通过
config/schema.py、config/loader.py、search/searcher.py 以及
config/config.yaml 进行配置驱动，从而保持结构的集中管理。

2026-03-23 14:45:06 +0800

69881ecb 相关性调参、enrich内容解析优化 Browse Dir »

tangwang
2026-03-23 09:02:19 +0800

22 Mar, 2026

4 commits

8140e942 translator model priority Browse Dir »

tangwang
2026-03-22 22:30:14 +0800
86d0e83d query翻译，根据源语言是否在索引语言中区分配置 Browse Dir »

tangwang
2026-03-22 18:53:53 +0800
0536222c query parser优化 Browse Dir »

tangwang
2026-03-22 18:30:05 +0800
ef5baa86 混杂语言处理 Browse Dir »

tangwang
2026-03-22 14:16:39 +0800

21 Mar, 2026

2 commits

fb973d19 configs Browse Dir »

tangwang
2026-03-21 22:11:41 +0800
00c8ddb9 suggest rank optimize Browse Dir »

tangwang
2026-03-21 19:41:23 +0800

20 Mar, 2026

6 commits

272aeabe 调参 Browse Dir »

tangwang
2026-03-20 17:37:04 +0800
a7cc9078 sku排序 Browse Dir »

tangwang
2026-03-20 17:02:19 +0800
e874eb50 docs Browse Dir »

tangwang
2026-03-20 16:12:22 +0800
1556989b query翻译等待超时逻辑 Browse Dir »

tangwang
2026-03-20 14:29:57 +0800
fe80e80e fix host config Browse Dir »

tangwang
2026-03-20 12:30:23 +0800
b754fd41 图片向量化支持优先级参数 Browse Dir »

tangwang
2026-03-20 11:59:57 +0800

19 Mar, 2026

7 commits

41f0b2e9 product_enrich支持并发 Browse Dir »

tangwang
2026-03-19 23:32:53 +0800
86d8358b config optimize Browse Dir »

tangwang
2026-03-19 23:04:11 +0800
77bfa7e3 query translate Browse Dir »

tangwang
2026-03-19 17:22:14 +0800
af03fdef embedding模块代码整理 Browse Dir »

tangwang
2026-03-19 14:24:35 +0800

7214c2e7 mplemented** ... Browse Dir »

- Text and image embedding are now split into separate
  services/processes, while still keeping a single replica as requested.
The split lives in
[embeddings/server.py](/data/saas-search/embeddings/server.py#L112),
[config/services_config.py](/data/saas-search/config/services_config.py#L68),
[providers/embedding.py](/data/saas-search/providers/embedding.py#L27),
and the start scripts
[scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L36),
[scripts/start_embedding_text_service.sh](/data/saas-search/scripts/start_embedding_text_service.sh),
[scripts/start_embedding_image_service.sh](/data/saas-search/scripts/start_embedding_image_service.sh).
- Independent admission control is in place now: text and image have
  separate inflight limits, and image can be kept much stricter than
text. The request handling, reject path, `/health`, and `/ready` are in
[embeddings/server.py](/data/saas-search/embeddings/server.py#L613),
[embeddings/server.py](/data/saas-search/embeddings/server.py#L786), and
[embeddings/server.py](/data/saas-search/embeddings/server.py#L1028).
- I checked the Redis embedding cache. It did exist, but there was a
  real flaw: cache keys did not distinguish `normalize=true` from
`normalize=false`. I fixed that in
[embeddings/cache_keys.py](/data/saas-search/embeddings/cache_keys.py#L6),
and both text and image now use the same normalize-aware keying. I also
added service-side BF16 cache hits that short-circuit before the model
lane, so repeated requests no longer get throttled behind image
inference.

**What This Means**
- Image pressure no longer blocks text, because they are on different
  ports/processes.
- Repeated text/image requests now return from Redis without consuming
  model capacity.
- Over-capacity requests are rejected quickly instead of sitting
  blocked.
- I did not add a load balancer or multi-replica HA, per your GPU
  constraint. I also did not build Grafana/Prometheus dashboards in this
pass, but `/health` now exposes the metrics needed to wire them.

**Validation**
- Tests passed: `.venv/bin/python -m pytest -q
  tests/test_embedding_pipeline.py
tests/test_embedding_service_limits.py` -> `10 passed`
- Stress test tool updates are in
  [scripts/perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py#L155)
- Fresh benchmark on split text service `6105`: 535 requests / 3s, 100%
  success, `174.56 rps`, avg `88.48 ms`
- Fresh benchmark on split image service `6108`: 1213 requests / 3s,
  100% success, `403.32 rps`, avg `9.64 ms`
- Live health after the run showed cache hits and non-zero cache-hit
  latency accounting:
  - text `avg_latency_ms=4.251`
  - image `avg_latency_ms=1.462`

2026-03-19 13:21:01 +0800

4747e2f4 embedding performance ... Browse Dir »

The instability is very likely real overload, but `lsof -i :6005 | wc -l
= 75` alone does not prove it. What does matter is the live shape of the
service: it is a single `uvicorn` worker on port `6005`, and the code
had one shared process handling both text and image requests, with image
work serialized behind a single lock. Under bursty image traffic,
requests could pile up and sit blocked with almost no useful tracing,
which matches the “only blocking observed” symptom.

now adds persistent log files, request IDs, per-request
request/response/failure logs, text microbatch dispatch logs, health
stats with active/rejected counts, and explicit overload admission
control. New knobs are `TEXT_MAX_INFLIGHT`, `IMAGE_MAX_INFLIGHT`, and
`EMBEDDING_OVERLOAD_STATUS_CODE`. Startup output now shows those limits
and log paths in
[scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L80).
I also added focused tests in
[tests/test_embedding_service_limits.py](/data/saas-search/tests/test_embedding_service_limits.py#L1).

What this means operationally:
- Text and image are still in one process, so this is not the final
  architecture.
- But image spikes will now be rejected quickly once the image lane is
  full instead of sitting around and consuming the worker pool.
- Logs will now show each request, each rejection, each microbatch
  dispatch, backend time, response time, and request ID.

Verification:
- Passed: `.venv/bin/python -m pytest -q
  tests/test_embedding_service_limits.py`
- I also ran a wider test command, but 3 failures came from pre-existing
  drift in
[tests/test_embedding_pipeline.py](/data/saas-search/tests/test_embedding_pipeline.py#L95),
where the tests still monkeypatch `embeddings.text_encoder.redis.Redis`
even though
[embeddings/text_encoder.py](/data/saas-search/embeddings/text_encoder.py#L1)
no longer imports `redis` that way.

已把 CLIP_AS_SERVICE 的默认模型切到
ViT-L-14，并把这套配置收口成可变更的统一入口了。现在默认值在
embeddings/config.py (line 29) 的 CLIP_AS_SERVICE_MODEL_NAME，当前为
CN-CLIP/ViT-L-14；scripts/start_cnclip_service.sh (line 37)
会自动读取这个配置，不再把默认模型写死在脚本里，同时支持
CNCLIP_MODEL_NAME 和 --model-name
临时覆盖。scripts/start_embedding_service.sh (line 29) 和
embeddings/server.py (line 425)
也补了模型信息输出，方便排查实际连接的配置。

文档也一起更新了，重点在 docs/CNCLIP_SERVICE说明文档.md (line 62) 和
embeddings/README.md (line
58)：现在说明的是“以配置为准、可覆盖”的机制，而不是写死某个模型名；相关总结文档和内部说明也同步改成了配置驱动表述。

2026-03-19 12:27:05 +0800

46ce858d 在NLLB模型的 /data/saas-search/config/config.yaml#L133 ... Browse Dir »

中采用了最优T4配置：ct2_inter_threads=2、ct2_max_queued_batches=16、ct2_batch_type=examples。该设置使NLLB获得了显著更优的在线式性能，同时大致保持了大批次吞吐量不变。我没有将相同配置应用于两个Marian模型，因为聚焦式报告显示了复杂的权衡：opus-mt-zh-en
在保守默认配置下更为均衡，而 opus-mt-en-zh 虽然获得了吞吐量提升，但在
c=8 时尾延迟波动较大。
我还将部署/配置经验记录在 /data/saas-search/translation/README.md
中，并在 /data/saas-search/docs/TODO.txt
中标记了优化结果。关键实践要点现已记录如下：使用CT2 +
float16，保持单worker，将NLLB的 inter_threads 设为2、max_queued_batches
设为16，在T4上避免使用
inter_threads=4（因为这会损害高批次吞吐量），除非区分在线/离线配置，否则保持Marian模型的默认配置保守。

2026-03-19 07:45:15 +0800

18 Mar, 2026

4 commits

ea293660 CTranslate2 ... Browse Dir »

Implemented CTranslate2 for the three local translation models and
switched the existing local_nllb / local_marian factories over to it.
The new runtime lives in local_ctranslate2.py, including HF->CT2
auto-conversion, float16 compute type mapping, Marian direction
handling, and NLLB target-prefix decoding. The service wiring is in
service.py (line 113), and the three model configs now point at explicit
ctranslate2-float16 dirs in config.yaml (line 133).

I also updated the setup path so this is usable end-to-end:
ctranslate2>=4.7.0 was added to requirements_translator_service.txt and
requirements.txt, the download script now supports pre-conversion in
download_translation_models.py (line 27), and the docs/config examples
were refreshed in translation/README.md. I installed ctranslate2 into
.venv-translator, pre-converted all three models, and the CT2 artifacts
are now already on disk:

models/translation/facebook/nllb-200-distilled-600M/ctranslate2-float16
models/translation/Helsinki-NLP/opus-mt-zh-en/ctranslate2-float16
models/translation/Helsinki-NLP/opus-mt-en-zh/ctranslate2-float16
Verification was solid. python3 -m compileall passed, direct
TranslationService smoke tests ran successfully in .venv-translator, and
the focused NLLB benchmark on the local GPU showed a clear win:

batch_size=16: HF 0.347s/batch, 46.1 items/s vs CT2 0.130s/batch, 123.0
items/s
batch_size=1: HF 0.396s/request vs CT2 0.126s/request
One caveat: translation quality on some very short phrases, especially
opus-mt-en-zh, still looks a bit rough in smoke tests, so I’d run your
real quality set before fully cutting over. If you want, I can take the
next step and update the benchmark script/report so you have a fresh
full CT2 performance report for all three models.

2026-03-18 23:15:46 +0800

cd4ce66d trans logs Browse Dir »

tangwang
2026-03-18 20:32:37 +0800
c90f80ed 相关性优化 Browse Dir »

tangwang
2026-03-18 16:44:27 +0800
a8261ece 检索效果优化 Browse Dir »

tangwang
2026-03-18 10:55:57 +0800