19 Mar, 2026

1 commit

  • - Text and image embedding are now split into separate
      services/processes, while still keeping a single replica as requested.
    The split lives in
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L112),
    [config/services_config.py](/data/saas-search/config/services_config.py#L68),
    [providers/embedding.py](/data/saas-search/providers/embedding.py#L27),
    and the start scripts
    [scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L36),
    [scripts/start_embedding_text_service.sh](/data/saas-search/scripts/start_embedding_text_service.sh),
    [scripts/start_embedding_image_service.sh](/data/saas-search/scripts/start_embedding_image_service.sh).
    - Independent admission control is in place now: text and image have
      separate inflight limits, and image can be kept much stricter than
    text. The request handling, reject path, `/health`, and `/ready` are in
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L613),
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L786), and
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L1028).
    - I checked the Redis embedding cache. It did exist, but there was a
      real flaw: cache keys did not distinguish `normalize=true` from
    `normalize=false`. I fixed that in
    [embeddings/cache_keys.py](/data/saas-search/embeddings/cache_keys.py#L6),
    and both text and image now use the same normalize-aware keying. I also
    added service-side BF16 cache hits that short-circuit before the model
    lane, so repeated requests no longer get throttled behind image
    inference.
    
    **What This Means**
    - Image pressure no longer blocks text, because they are on different
      ports/processes.
    - Repeated text/image requests now return from Redis without consuming
      model capacity.
    - Over-capacity requests are rejected quickly instead of sitting
      blocked.
    - I did not add a load balancer or multi-replica HA, per your GPU
      constraint. I also did not build Grafana/Prometheus dashboards in this
    pass, but `/health` now exposes the metrics needed to wire them.
    
    **Validation**
    - Tests passed: `.venv/bin/python -m pytest -q
      tests/test_embedding_pipeline.py
    tests/test_embedding_service_limits.py` -> `10 passed`
    - Stress test tool updates are in
      [scripts/perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py#L155)
    - Fresh benchmark on split text service `6105`: 535 requests / 3s, 100%
      success, `174.56 rps`, avg `88.48 ms`
    - Fresh benchmark on split image service `6108`: 1213 requests / 3s,
      100% success, `403.32 rps`, avg `9.64 ms`
    - Live health after the run showed cache hits and non-zero cache-hit
      latency accounting:
      - text `avg_latency_ms=4.251`
      - image `avg_latency_ms=1.462`
    tangwang
     

12 Mar, 2026

3 commits


11 Mar, 2026

1 commit

  • ./scripts/start_tei_service.sh
    START_TEI=0 ./scripts/service_ctl.sh restart embedding
    
    curl -sS -X POST "http://127.0.0.1:6005/embed/text" \
      -H "Content-Type: application/json" \
      -d '["芭比娃娃 儿童玩具", "纯棉T恤 短袖"]'
    tangwang
     

10 Mar, 2026

1 commit

  • 和微服务(embedding/translate/rerank)。
    
    **新增文件**
    -
    压测主脚本:[perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py:1)
    -
    自定义用例模板:[perf_cases.json.example](/data/saas-search/scripts/perf_cases.json.example:1)
    
    **文档更新**
    -
    在接口对接文档增加“接口级压测脚本”章节:[搜索API对接指南.md](/data/saas-search/docs/搜索API对接指南.md:2089)
    
    **支持的场景**
    - `backend_search` -> `POST /search/`
    - `backend_suggest` -> `GET /search/suggestions`
    - `embed_text` -> `POST /embed/text`
    - `translate` -> `POST /translate`
    - `rerank` -> `POST /rerank`
    - `all` -> 依次执行上述全部场景
    
    **你可以直接执行的命令**
    1. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario
       backend_suggest --tenant-id 162 --duration 30 --concurrency 50`
    2. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario
       backend_search --tenant-id 162 --duration 30 --concurrency 20`
    3. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario all
       --tenant-id 162 --duration 60 --concurrency 30 --output
    perf_reports/all.json`
    4. `./.venv/bin/python scripts/perf_api_benchmark.py --scenario all
       --tenant-id 162 --cases-file scripts/perf_cases.json.example
    --duration 60 --concurrency 40 --output perf_reports/custom_all.json`
    
    **可选参数**
    - `--backend-base` `--embedding-base` `--translator-base`
      `--reranker-base`:切到你的实际服务地址
    - `--max-requests`:限制总请求数
    - `--max-errors`:错误达到阈值提前停止
    - `--pause`:`all` 模式下场景间暂停
    
    **本地已验证**
    - `backend_suggest` 小规模并发压测成功(200,成功率 100%)
    - `backend_search` 小规模并发压测成功(200,成功率 100%)
    - `translate` 小规模并发压测成功(200,成功率 100%)
    tangwang