20 Mar, 2026
3 commits
-
## 背景 多语言索引下,用户查询常中英混写;需在解析阶段显式标记脚本类型,并在 BM25 子句中同时覆盖对应语言字段。 ## 方案 ### 1. Query 分析(query_parser.ParsedQuery) - 新增 `contains_chinese`:query 文本含 CJK(沿用 _contains_cjk)。 - 新增 `contains_english`:分词结果中存在「纯英文、len>=3」token(fullmatch 字母及可选连字符)。 - 写入 to_dict、请求 context 中间结果,便于调试与 API 透出。 ### 2. ES 文本召回(es_query_builder._build_advanced_text_query) - 对每个 search_lang 子句:若含英文且子句语言非 en(且租户 index_languages 含 en),合并 en 列字段;若含中文且子句语言非 zh(且含 zh),合并 zh 列字段。 - 合并进来的字段 boost 乘以 `mixed_script_merged_field_boost_scale`(默认 0.8,可在 ESQueryBuilder 构造参数调整)。 - fallback_original_query_* 分支同样应用上述逻辑。 ### 3. 实现整理 - 引入 `MatchFieldSpec = (field_path, boost)`:`_build_match_field_specs` 为唯一权重来源;`_merge_supplemental_lang_field_specs` / `_expand_match_field_specs_for_mixed_script` 在 tuple 上合并与缩放;最后 `_format_match_field_specs` 再格式化为 ES `path^boost`,避免先拼字符串再解析。 ## 测试 - tests/test_query_parser_mixed_language.py:脚本标记与 token 规则。 - tests/test_es_query_builder.py:合并字段、0.8 缩放、index_languages 限制。 Made-with: Cursor
19 Mar, 2026
5 commits
-
- Text and image embedding are now split into separate services/processes, while still keeping a single replica as requested. The split lives in [embeddings/server.py](/data/saas-search/embeddings/server.py#L112), [config/services_config.py](/data/saas-search/config/services_config.py#L68), [providers/embedding.py](/data/saas-search/providers/embedding.py#L27), and the start scripts [scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L36), [scripts/start_embedding_text_service.sh](/data/saas-search/scripts/start_embedding_text_service.sh), [scripts/start_embedding_image_service.sh](/data/saas-search/scripts/start_embedding_image_service.sh). - Independent admission control is in place now: text and image have separate inflight limits, and image can be kept much stricter than text. The request handling, reject path, `/health`, and `/ready` are in [embeddings/server.py](/data/saas-search/embeddings/server.py#L613), [embeddings/server.py](/data/saas-search/embeddings/server.py#L786), and [embeddings/server.py](/data/saas-search/embeddings/server.py#L1028). - I checked the Redis embedding cache. It did exist, but there was a real flaw: cache keys did not distinguish `normalize=true` from `normalize=false`. I fixed that in [embeddings/cache_keys.py](/data/saas-search/embeddings/cache_keys.py#L6), and both text and image now use the same normalize-aware keying. I also added service-side BF16 cache hits that short-circuit before the model lane, so repeated requests no longer get throttled behind image inference. **What This Means** - Image pressure no longer blocks text, because they are on different ports/processes. - Repeated text/image requests now return from Redis without consuming model capacity. - Over-capacity requests are rejected quickly instead of sitting blocked. - I did not add a load balancer or multi-replica HA, per your GPU constraint. I also did not build Grafana/Prometheus dashboards in this pass, but `/health` now exposes the metrics needed to wire them. **Validation** - Tests passed: `.venv/bin/python -m pytest -q tests/test_embedding_pipeline.py tests/test_embedding_service_limits.py` -> `10 passed` - Stress test tool updates are in [scripts/perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py#L155) - Fresh benchmark on split text service `6105`: 535 requests / 3s, 100% success, `174.56 rps`, avg `88.48 ms` - Fresh benchmark on split image service `6108`: 1213 requests / 3s, 100% success, `403.32 rps`, avg `9.64 ms` - Live health after the run showed cache hits and non-zero cache-hit latency accounting: - text `avg_latency_ms=4.251` - image `avg_latency_ms=1.462`
-
The instability is very likely real overload, but `lsof -i :6005 | wc -l = 75` alone does not prove it. What does matter is the live shape of the service: it is a single `uvicorn` worker on port `6005`, and the code had one shared process handling both text and image requests, with image work serialized behind a single lock. Under bursty image traffic, requests could pile up and sit blocked with almost no useful tracing, which matches the “only blocking observed” symptom. now adds persistent log files, request IDs, per-request request/response/failure logs, text microbatch dispatch logs, health stats with active/rejected counts, and explicit overload admission control. New knobs are `TEXT_MAX_INFLIGHT`, `IMAGE_MAX_INFLIGHT`, and `EMBEDDING_OVERLOAD_STATUS_CODE`. Startup output now shows those limits and log paths in [scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L80). I also added focused tests in [tests/test_embedding_service_limits.py](/data/saas-search/tests/test_embedding_service_limits.py#L1). What this means operationally: - Text and image are still in one process, so this is not the final architecture. - But image spikes will now be rejected quickly once the image lane is full instead of sitting around and consuming the worker pool. - Logs will now show each request, each rejection, each microbatch dispatch, backend time, response time, and request ID. Verification: - Passed: `.venv/bin/python -m pytest -q tests/test_embedding_service_limits.py` - I also ran a wider test command, but 3 failures came from pre-existing drift in [tests/test_embedding_pipeline.py](/data/saas-search/tests/test_embedding_pipeline.py#L95), where the tests still monkeypatch `embeddings.text_encoder.redis.Redis` even though [embeddings/text_encoder.py](/data/saas-search/embeddings/text_encoder.py#L1) no longer imports `redis` that way. 已把 CLIP_AS_SERVICE 的默认模型切到 ViT-L-14,并把这套配置收口成可变更的统一入口了。现在默认值在 embeddings/config.py (line 29) 的 CLIP_AS_SERVICE_MODEL_NAME,当前为 CN-CLIP/ViT-L-14;scripts/start_cnclip_service.sh (line 37) 会自动读取这个配置,不再把默认模型写死在脚本里,同时支持 CNCLIP_MODEL_NAME 和 --model-name 临时覆盖。scripts/start_embedding_service.sh (line 29) 和 embeddings/server.py (line 425) 也补了模型信息输出,方便排查实际连接的配置。 文档也一起更新了,重点在 docs/CNCLIP_SERVICE说明文档.md (line 62) 和 embeddings/README.md (line 58):现在说明的是“以配置为准、可覆盖”的机制,而不是写死某个模型名;相关总结文档和内部说明也同步改成了配置驱动表述。
-
推理”,不再是先按原始输入条数切块。也就是说,如果 100 条请求分句后变成 150 个 segments,batch_size=64 时会按 64 + 64 + 22 三批推理,推理完再按原始分句计划合并并还原成 100 条返回。这个改动在 local_seq2seq.py (line 241) 和 local_ctranslate2.py (line 391)。 日志这边也补上了两层你要的关键信息: 分句摘要日志:Translation segmentation summary,会打印输入条数、非空条数、发生分句的输入数、总 segments 数、当前 batch_size、每条输入分成多少段的统计,见 local_seq2seq.py (line 216) 和 local_ctranslate2.py (line 366)。 每个预测批次日志:Translation inference batch,会打印第几批、总批数、该批 segment 数、长度统计、首条预览。CTranslate2 另外还会打印 Translation model batch detail,补充 token 长度和 max_decoding_length,见 local_ctranslate2.py (line 294)。 我也补了测试,覆盖了“分句后再 batching”和“日志中有分句摘要与每批推理日志”,在 test_translation_local_backends.py (line 358)。
-
改动: 新增分句与预算工具:translation/text_splitter.py 接入 HF 本地后端:translation/backends/local_seq2seq.py (line 157) 接入 CT2 本地后端:translation/backends/local_ctranslate2.py (line 301) 补了测试:tests/test_translation_local_backends.py 我先把代码里实际限制梳理了一遍,关键配置在 config/config.yaml (line 133): nllb-200-distilled-600m: max_input_length=256,max_new_tokens=64,并且是 ct2_decoding_length_mode=source + extra=8。现在按这个配置计算出的保守输入预算是 56 token。 opus-mt-zh-en: max_input_length=256,max_new_tokens=256。现在保守输入预算是 248 token。 opus-mt-en-zh: 同上,也是 248 token。 这版分句策略是: 先按强边界切:。!?!?;;…、换行、英文句号 不够再按弱边界切:,,、::()()[]【】/| 再不够才按空白切 最后才做 token 预算下的硬切 超长时会“分句翻译后再回拼”,中文目标语言默认无空格回拼,英文等默认按空格回拼,尽量别切太碎 验证: python3 -m compileall translation tests/test_translation_local_backends.py 已通过
18 Mar, 2026
6 commits
-
核心改动在 rerank_client.py (line 99):fuse_scores_and_resort 现在按 rerank * knn * text 的平滑乘法公式计算,优先从 hit["matched_queries"] 里取 base_query 和 knn_query,并把 _text_score / _knn_score 一并写回调试字段。为了让 KNN 也有名字,我给 top-level knn 加了 name: "knn_query",见 es_query_builder.py (line 273)。搜索执行时会在 rerank 窗口内打开 include_named_queries_score,并在显式排序时加上 track_scores,见 searcher.py (line 400) 和 es_client.py (line 224)。
-
2. 优化缓存,缓存粒度为商品级,每次只对batch中未cache的重新计算;key使用每个商品输入的hash
17 Mar, 2026
4 commits
-
2. 抽象出可复用的 embedding Redis 缓存类(图文共用) 详细: 1. embedding 缓存改为 BF16 存 Redis(读回恢复 FP32) 关键行为(按你给的流程落地) 写入前:FP32 embedding →(normalize_embeddings=True 时)L2 normalize → 转 BF16 → bytes(2字节/维,大端) → redis.setex 读取后:redis.get bytes → BF16 → 恢复 FP32(np.float32 向量) 变更点 新增 embeddings/bf16.py 提供 float32_to_bf16 / bf16_to_float32 encode_embedding_for_redis():FP32 → BF16 → bytes decode_embedding_from_redis():bytes → BF16 → FP32 l2_normalize_fp32():按需归一化 修改 embeddings/text_encoder.py Redis value 从 pickle.dumps(np.ndarray) 改为 BF16 bytes 缓存 key 改为包含 normalize 标记:{prefix}:{n0|n1}:{query}(避免 normalize 开关不同却共用缓存) 修改 tests/test_embedding_pipeline.py cache hit 用例改为写入 BF16 bytes,并使用新 key:embedding:n1:cached-text 修改 docs/缓存与Redis使用说明.md embedding 缓存的 Key/Value 格式更新为 BF16 bytes + n0/n1 修改 scripts/redis/redis_cache_health_check.py embedding pattern 不再硬编码 embedding:*,改为读取 REDIS_CONFIG["embedding_cache_prefix"] value 预览从 pickle 解码改为 BF16 解码后展示 dim/bytes/dtype 自检 在激活环境后跑过 BF16 编解码往返 sanity check:bytes 长度、维度恢复正常;归一化向量读回后范数接近 1(会有 BF16 量化误差)。 2. 抽象出可复用的 embedding Redis 缓存类(图文共用) 新增 embeddings/redis_embedding_cache.py:RedisEmbeddingCache 统一 Redis 初始化(读 REDIS_CONFIG) 统一 BF16 bytes 编解码(复用 embeddings/bf16.py) 统一过期策略:写入 setex(expire_time),命中读取后 expire(expire_time) 滑动过期刷新 TTL 统一异常/坏数据处理:解码失败或向量非 1D/为空/含 NaN/Inf 会删除该 key 并当作 miss 已接入复用 文本 embeddings/text_encoder.py 用 self.cache = RedisEmbeddingCache(key_prefix=..., namespace="") key 仍是:{prefix}:{query} 图片 embeddings/image_encoder.py 用 self.cache = RedisEmbeddingCache(key_prefix=..., namespace="image") key 仍是:{prefix}:image:{url_or_path} -
- Rename indexer/product_annotator.py to indexer/product_enrich.py and remove CSV-based CLI entrypoint, keeping only in-memory analyze_products API - Introduce dedicated product_enrich logging with separate verbose log file for full LLM requests/responses - Change indexer and /indexer/enrich-content API wiring to use indexer.product_enrich instead of indexer.product_annotator, updating tests and docs accordingly - Switch translate_prompts to share SUPPORTED_INDEX_LANGUAGES from tenant_config_loader and reuse that mapping for language code → display name - Remove hard SUPPORTED_LANGS constraint from LLM content-enrichment flow, driving languages directly from tenant/indexer configuration - Redesign LLM prompt generation to support multi-round, multi-language tables: first round in English, subsequent rounds translate the entire table (headers + cells) into target languages using English instructions
13 Mar, 2026
6 commits
-
2. 翻译限速 对应处理(qwen-mt限速)
12 Mar, 2026
5 commits
11 Mar, 2026
3 commits
10 Mar, 2026
4 commits
-
和微服务(embedding/translate/rerank)。 **新增文件** - 压测主脚本:[perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py:1) - 自定义用例模板:[perf_cases.json.example](/data/saas-search/scripts/perf_cases.json.example:1) **文档更新** - 在接口对接文档增加“接口级压测脚本”章节:[搜索API对接指南.md](/data/saas-search/docs/搜索API对接指南.md:2089) **支持的场景** - `backend_search` -> `POST /search/` - `backend_suggest` -> `GET /search/suggestions` - `embed_text` -> `POST /embed/text` - `translate` -> `POST /translate` - `rerank` -> `POST /rerank` - `all` -> 依次执行上述全部场景 **你可以直接执行的命令** 1. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario backend_suggest --tenant-id 162 --duration 30 --concurrency 50` 2. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario backend_search --tenant-id 162 --duration 30 --concurrency 20` 3. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario all --tenant-id 162 --duration 60 --concurrency 30 --output perf_reports/all.json` 4. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario all --tenant-id 162 --cases-file scripts/perf_cases.json.example --duration 60 --concurrency 40 --output perf_reports/custom_all.json` **可选参数** - `--backend-base` `--embedding-base` `--translator-base` `--reranker-base`:切到你的实际服务地址 - `--max-requests`:限制总请求数 - `--max-errors`:错误达到阈值提前停止 - `--pause`:`all` 模式下场景间暂停 **本地已验证** - `backend_suggest` 小规模并发压测成功(200,成功率 100%) - `backend_search` 小规模并发压测成功(200,成功率 100%) - `translate` 小规模并发压测成功(200,成功率 100%)
09 Mar, 2026
4 commits
-
CNCLIP_DEVICE=cuda TEI_USE_GPU=1 ./scripts/service_ctl.sh start 搜索后端+indexer+测试前段+4个微服务 跑通