01 Apr, 2026
6 commits
-
scripts/evaluation/eval_framework/constants.py:500 → 200 Rebuild 里 rank <= recall_n 的 rerank_score: 1.0 仍按该 K 生效。 2. LLM 批次上下限 最少批次:DEFAULT_REBUILD_MIN_LLM_BATCHES 20 → 10 最多批次:仍为 40(未改) 3. 提前结束条件(_annotate_rebuild_batches) 在已跑满 min_batches 之后,对每个批次: 本批无 Exact(exact_n == 0),且满足其一即视为 bad batch: irrelevant_ratio >= 0.94 或 (irrelevant + Low Relevant) / n >= 0.96(弱相关用 RELEVANCE_LOW) 连续 2 个 bad batch 则 early stop(原先是连续 3 次、irrelevant > 0.92)。 批次日志里增加了 low_ratio、irrelevant_plus_low_ratio;rebuild 元数据里增加了 rebuild_irrel_low_combined_stop_ratio。 4. CLI --search-recall-top-k 说明改为默认 200 --rebuild-min-batches 说明改为默认 10 --rebuild-irrelevant-stop-ratio / --rebuild-irrelevant-stop-streak 说明与新逻辑一致 新增 --rebuild-irrel-low-combined-stop-ratio(默认 0.96)
31 Mar, 2026
11 commits
30 Mar, 2026
1 commit
27 Mar, 2026
4 commits
-
coarse_rank.output_window -> 再做 SKU 选择和 title suffix -> 精排调用轻量 reranker 裁到 fine_rank.output_window -> 最终重排调用现有 reranker,并在最终融合里加入 fine_score。同时把 reranker client/provider 改成了按 service_profile 选不同 service_url,这样 fine/final 可以共用同一套服务代码,只起不同实例。
26 Mar, 2026
3 commits
25 Mar, 2026
7 commits
-
(之前因为错误将attention方法该回到TRITON_ATTN,性能相比于之前的vllm版本更差。但是那个错误是能解决的。已修复保持FLASHINFER)
-
报错),并允许通过配置或环境变量让 vLLM 自行选择 attention。 -- 临时版本
-
这两个配置、四种情况: backend: qwen3_vllm | qwen3_vllm_score instruction_format: compact | standard 调用 python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 产出性能测试报告 平均延迟(ms,客户端 POST /rerank 墙钟,--seed 99) backend instruction_format n=100 n=200 n=400 n=600 n=800 n=1000 qwen3_vllm compact 213.5 418.0 861.4 1263.4 1744.3 2162.2 qwen3_vllm standard 254.9 475.4 909.7 1353.2 1912.5 2406.7 qwen3_vllm_score compact 239.2 480.2 966.2 1433.5 1937.2 2428.4 qwen3_vllm_score standard 299.6 591.8 1178.9 1773.7 2341.6 2931.7 归纳: 在本机 T4、当前 vLLM 与上述 YAML(max_model_len=160、infer_batch_size=100 等)下,两种后端都是 compact 快于 standard;整体最快为 qwen3_vllm + compact(n=1000 ≈ 2.16 s),最慢为 qwen3_vllm_score + standard(≈ 2.93 s)。其他 GPU / vLLM 版本下排序可能变化。
22 Mar, 2026
1 commit
21 Mar, 2026
2 commits
20 Mar, 2026
2 commits
19 Mar, 2026
3 commits
-
- Text and image embedding are now split into separate services/processes, while still keeping a single replica as requested. The split lives in [embeddings/server.py](/data/saas-search/embeddings/server.py#L112), [config/services_config.py](/data/saas-search/config/services_config.py#L68), [providers/embedding.py](/data/saas-search/providers/embedding.py#L27), and the start scripts [scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L36), [scripts/start_embedding_text_service.sh](/data/saas-search/scripts/start_embedding_text_service.sh), [scripts/start_embedding_image_service.sh](/data/saas-search/scripts/start_embedding_image_service.sh). - Independent admission control is in place now: text and image have separate inflight limits, and image can be kept much stricter than text. The request handling, reject path, `/health`, and `/ready` are in [embeddings/server.py](/data/saas-search/embeddings/server.py#L613), [embeddings/server.py](/data/saas-search/embeddings/server.py#L786), and [embeddings/server.py](/data/saas-search/embeddings/server.py#L1028). - I checked the Redis embedding cache. It did exist, but there was a real flaw: cache keys did not distinguish `normalize=true` from `normalize=false`. I fixed that in [embeddings/cache_keys.py](/data/saas-search/embeddings/cache_keys.py#L6), and both text and image now use the same normalize-aware keying. I also added service-side BF16 cache hits that short-circuit before the model lane, so repeated requests no longer get throttled behind image inference. **What This Means** - Image pressure no longer blocks text, because they are on different ports/processes. - Repeated text/image requests now return from Redis without consuming model capacity. - Over-capacity requests are rejected quickly instead of sitting blocked. - I did not add a load balancer or multi-replica HA, per your GPU constraint. I also did not build Grafana/Prometheus dashboards in this pass, but `/health` now exposes the metrics needed to wire them. **Validation** - Tests passed: `.venv/bin/python -m pytest -q tests/test_embedding_pipeline.py tests/test_embedding_service_limits.py` -> `10 passed` - Stress test tool updates are in [scripts/perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py#L155) - Fresh benchmark on split text service `6105`: 535 requests / 3s, 100% success, `174.56 rps`, avg `88.48 ms` - Fresh benchmark on split image service `6108`: 1213 requests / 3s, 100% success, `403.32 rps`, avg `9.64 ms` - Live health after the run showed cache hits and non-zero cache-hit latency accounting: - text `avg_latency_ms=4.251` - image `avg_latency_ms=1.462`