25 Mar, 2026

6 commits


22 Mar, 2026

1 commit


21 Mar, 2026

2 commits


20 Mar, 2026

2 commits


19 Mar, 2026

5 commits

  • tangwang
     
  • tangwang
     
  • - Text and image embedding are now split into separate
      services/processes, while still keeping a single replica as requested.
    The split lives in
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L112),
    [config/services_config.py](/data/saas-search/config/services_config.py#L68),
    [providers/embedding.py](/data/saas-search/providers/embedding.py#L27),
    and the start scripts
    [scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L36),
    [scripts/start_embedding_text_service.sh](/data/saas-search/scripts/start_embedding_text_service.sh),
    [scripts/start_embedding_image_service.sh](/data/saas-search/scripts/start_embedding_image_service.sh).
    - Independent admission control is in place now: text and image have
      separate inflight limits, and image can be kept much stricter than
    text. The request handling, reject path, `/health`, and `/ready` are in
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L613),
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L786), and
    [embeddings/server.py](/data/saas-search/embeddings/server.py#L1028).
    - I checked the Redis embedding cache. It did exist, but there was a
      real flaw: cache keys did not distinguish `normalize=true` from
    `normalize=false`. I fixed that in
    [embeddings/cache_keys.py](/data/saas-search/embeddings/cache_keys.py#L6),
    and both text and image now use the same normalize-aware keying. I also
    added service-side BF16 cache hits that short-circuit before the model
    lane, so repeated requests no longer get throttled behind image
    inference.
    
    **What This Means**
    - Image pressure no longer blocks text, because they are on different
      ports/processes.
    - Repeated text/image requests now return from Redis without consuming
      model capacity.
    - Over-capacity requests are rejected quickly instead of sitting
      blocked.
    - I did not add a load balancer or multi-replica HA, per your GPU
      constraint. I also did not build Grafana/Prometheus dashboards in this
    pass, but `/health` now exposes the metrics needed to wire them.
    
    **Validation**
    - Tests passed: `.venv/bin/python -m pytest -q
      tests/test_embedding_pipeline.py
    tests/test_embedding_service_limits.py` -> `10 passed`
    - Stress test tool updates are in
      [scripts/perf_api_benchmark.py](/data/saas-search/scripts/perf_api_benchmark.py#L155)
    - Fresh benchmark on split text service `6105`: 535 requests / 3s, 100%
      success, `174.56 rps`, avg `88.48 ms`
    - Fresh benchmark on split image service `6108`: 1213 requests / 3s,
      100% success, `403.32 rps`, avg `9.64 ms`
    - Live health after the run showed cache hits and non-zero cache-hit
      latency accounting:
      - text `avg_latency_ms=4.251`
      - image `avg_latency_ms=1.462`
    tangwang
     
  • The instability is very likely real overload, but `lsof -i :6005 | wc -l
    = 75` alone does not prove it. What does matter is the live shape of the
    service: it is a single `uvicorn` worker on port `6005`, and the code
    had one shared process handling both text and image requests, with image
    work serialized behind a single lock. Under bursty image traffic,
    requests could pile up and sit blocked with almost no useful tracing,
    which matches the “only blocking observed” symptom.
    
    now adds persistent log files, request IDs, per-request
    request/response/failure logs, text microbatch dispatch logs, health
    stats with active/rejected counts, and explicit overload admission
    control. New knobs are `TEXT_MAX_INFLIGHT`, `IMAGE_MAX_INFLIGHT`, and
    `EMBEDDING_OVERLOAD_STATUS_CODE`. Startup output now shows those limits
    and log paths in
    [scripts/start_embedding_service.sh](/data/saas-search/scripts/start_embedding_service.sh#L80).
    I also added focused tests in
    [tests/test_embedding_service_limits.py](/data/saas-search/tests/test_embedding_service_limits.py#L1).
    
    What this means operationally:
    - Text and image are still in one process, so this is not the final
      architecture.
    - But image spikes will now be rejected quickly once the image lane is
      full instead of sitting around and consuming the worker pool.
    - Logs will now show each request, each rejection, each microbatch
      dispatch, backend time, response time, and request ID.
    
    Verification:
    - Passed: `.venv/bin/python -m pytest -q
      tests/test_embedding_service_limits.py`
    - I also ran a wider test command, but 3 failures came from pre-existing
      drift in
    [tests/test_embedding_pipeline.py](/data/saas-search/tests/test_embedding_pipeline.py#L95),
    where the tests still monkeypatch `embeddings.text_encoder.redis.Redis`
    even though
    [embeddings/text_encoder.py](/data/saas-search/embeddings/text_encoder.py#L1)
    no longer imports `redis` that way.
    
    已把 CLIP_AS_SERVICE 的默认模型切到
    ViT-L-14,并把这套配置收口成可变更的统一入口了。现在默认值在
    embeddings/config.py (line 29) 的 CLIP_AS_SERVICE_MODEL_NAME,当前为
    CN-CLIP/ViT-L-14;scripts/start_cnclip_service.sh (line 37)
    会自动读取这个配置,不再把默认模型写死在脚本里,同时支持
    CNCLIP_MODEL_NAME 和 --model-name
    临时覆盖。scripts/start_embedding_service.sh (line 29) 和
    embeddings/server.py (line 425)
    也补了模型信息输出,方便排查实际连接的配置。
    
    文档也一起更新了,重点在 docs/CNCLIP_SERVICE说明文档.md (line 62) 和
    embeddings/README.md (line
    58):现在说明的是“以配置为准、可覆盖”的机制,而不是写死某个模型名;相关总结文档和内部说明也同步改成了配置驱动表述。
    tangwang
     
  • 中采用了最优T4配置:ct2_inter_threads=2、ct2_max_queued_batches=16、ct2_batch_type=examples。该设置使NLLB获得了显著更优的在线式性能,同时大致保持了大批次吞吐量不变。我没有将相同配置应用于两个Marian模型,因为聚焦式报告显示了复杂的权衡:opus-mt-zh-en
    在保守默认配置下更为均衡,而 opus-mt-en-zh 虽然获得了吞吐量提升,但在
    c=8 时尾延迟波动较大。
    我还将部署/配置经验记录在 /data/saas-search/translation/README.md
    中,并在 /data/saas-search/docs/TODO.txt
    中标记了优化结果。关键实践要点现已记录如下:使用CT2 +
    float16,保持单worker,将NLLB的 inter_threads 设为2、max_queued_batches
    设为16,在T4上避免使用
    inter_threads=4(因为这会损害高批次吞吐量),除非区分在线/离线配置,否则保持Marian模型的默认配置保守。
    tangwang
     

18 Mar, 2026

4 commits

  • Implemented CTranslate2 for the three local translation models and
    switched the existing local_nllb / local_marian factories over to it.
    The new runtime lives in local_ctranslate2.py, including HF->CT2
    auto-conversion, float16 compute type mapping, Marian direction
    handling, and NLLB target-prefix decoding. The service wiring is in
    service.py (line 113), and the three model configs now point at explicit
    ctranslate2-float16 dirs in config.yaml (line 133).
    
    I also updated the setup path so this is usable end-to-end:
    ctranslate2>=4.7.0 was added to requirements_translator_service.txt and
    requirements.txt, the download script now supports pre-conversion in
    download_translation_models.py (line 27), and the docs/config examples
    were refreshed in translation/README.md. I installed ctranslate2 into
    .venv-translator, pre-converted all three models, and the CT2 artifacts
    are now already on disk:
    
    models/translation/facebook/nllb-200-distilled-600M/ctranslate2-float16
    models/translation/Helsinki-NLP/opus-mt-zh-en/ctranslate2-float16
    models/translation/Helsinki-NLP/opus-mt-en-zh/ctranslate2-float16
    Verification was solid. python3 -m compileall passed, direct
    TranslationService smoke tests ran successfully in .venv-translator, and
    the focused NLLB benchmark on the local GPU showed a clear win:
    
    batch_size=16: HF 0.347s/batch, 46.1 items/s vs CT2 0.130s/batch, 123.0
    items/s
    batch_size=1: HF 0.396s/request vs CT2 0.126s/request
    One caveat: translation quality on some very short phrases, especially
    opus-mt-en-zh, still looks a bit rough in smoke tests, so I’d run your
    real quality set before fully cutting over. If you want, I can take the
    next step and update the benchmark script/report so you have a fresh
    full CT2 performance report for all three models.
    tangwang
     
  • batch×并发矩阵”彻底分开展示。
    
    改动在这几处:
    
    scripts/benchmark_translation_local_models.py:新增 --suite
    extended,支持
    batch_size=1,4,8,16,32,64、concurrency=1,2,4,8,16,64、以及 batch_size *
    concurrency <= 128
    的组合矩阵;并且单场景模式现在只加载目标模型,load_seconds
    更干净,也支持 --disable-cache。
    translation/README.md:把性能章节拆成了
    batch_sweep、concurrency_sweep、batch x concurrency matrix
    三块,补了这次复测的参数、复现命令和摘要表。
    perf_reports/20260318/translation_local_models/README.md:新增本轮补测摘要。
    完整结果在 translation_local_models_extended_221846.md 和
    translation_local_models_extended_221846.json。
    这次补测的核心结论很明确:
    
    在线单条请求应该看 concurrency_sweep,也就是固定 batch_size=1 的表。
    离线批量吞吐应该看 batch_sweep,4 个方向的最高 raw throughput 都出现在
    batch_size=64,但更均衡的默认值仍更像 batch_size=16。
    当前本地 seq2seq backend
    有单模型锁,提升客户端并发几乎不涨吞吐,主要是把排队时间变成更高的
    p95;所以并发更像“延迟预算”问题,不是“扩容吞吐”手段。
    本轮在线单条里最快的是 opus-mt-zh-en;最慢、且并发放大最明显的是
    nllb-200-distilled-600m en->zh。
    tangwang
     
  • tangwang
     
  • tangwang
     

17 Mar, 2026

5 commits

  • tangwang
     
  • tangwang
     
  • tangwang
     
  • 2. 抽象出可复用的 embedding Redis 缓存类(图文共用)
    
    详细:
    1. embedding 缓存改为 BF16 存 Redis(读回恢复 FP32)
    关键行为(按你给的流程落地)
    写入前:FP32 embedding →(normalize_embeddings=True 时)L2 normalize →
    转 BF16 → bytes(2字节/维,大端) → redis.setex
    读取后:redis.get bytes → BF16 → 恢复 FP32(np.float32 向量)
    变更点
    新增 embeddings/bf16.py
    提供 float32_to_bf16 / bf16_to_float32
    encode_embedding_for_redis():FP32 → BF16 → bytes
    decode_embedding_from_redis():bytes → BF16 → FP32
    l2_normalize_fp32():按需归一化
    修改 embeddings/text_encoder.py
    Redis value 从 pickle.dumps(np.ndarray) 改为 BF16 bytes
    缓存 key 改为包含 normalize 标记:{prefix}:{n0|n1}:{query}(避免
    normalize 开关不同却共用缓存)
    修改 tests/test_embedding_pipeline.py
    cache hit 用例改为写入 BF16 bytes,并使用新
    key:embedding:n1:cached-text
    修改 docs/缓存与Redis使用说明.md
    embedding 缓存的 Key/Value 格式更新为 BF16 bytes + n0/n1
    修改 scripts/redis/redis_cache_health_check.py
    embedding pattern 不再硬编码 embedding:*,改为读取
    REDIS_CONFIG["embedding_cache_prefix"]
    value 预览从 pickle 解码改为 BF16 解码后展示 dim/bytes/dtype
    自检
    在激活环境后跑过 BF16 编解码往返 sanity check:bytes
    长度、维度恢复正常;归一化向量读回后范数接近 1(会有 BF16 量化误差)。
    
    2. 抽象出可复用的 embedding Redis 缓存类(图文共用)
    新增
    embeddings/redis_embedding_cache.py:RedisEmbeddingCache
    统一 Redis 初始化(读 REDIS_CONFIG)
    统一 BF16 bytes 编解码(复用 embeddings/bf16.py)
    统一过期策略:写入 setex(expire_time),命中读取后 expire(expire_time)
    滑动过期刷新 TTL
    统一异常/坏数据处理:解码失败或向量非 1D/为空/含 NaN/Inf 会删除该 key
    并当作 miss
    已接入复用
    文本 embeddings/text_encoder.py
    用 self.cache = RedisEmbeddingCache(key_prefix=..., namespace="")
    key 仍是:{prefix}:{query}
    图片 embeddings/image_encoder.py
    用 self.cache = RedisEmbeddingCache(key_prefix=..., namespace="image")
    key 仍是:{prefix}:image:{url_or_path}
    tangwang
     
  • tangwang
     

13 Mar, 2026

5 commits


12 Mar, 2026

6 commits


11 Mar, 2026

4 commits

  • tangwang
     
  • 去掉 START_* 控制变量逻辑,默认只启动核心服务 backend/indexer/frontend。
    可选服务改为显式命令:./scripts/service_ctl.sh start embedding
    translator reranker tei cnclip。
    统一 translator 端口读取为 TRANSLATION_PORT(移除 TRANSLATOR_PORT
    兼容)。
    保留未知服务强校验。
    关键文件:service_ctl.sh
    “重名/歧义”修复
    frontend 端口命名统一:FRONTEND_PORT 为主,PORT 仅后备。
    start_frontend.sh 显式导出 PORT="${FRONTEND_PORT}",避免配置了
    FRONTEND_PORT 但服务仍跑 6003 的问题。
    文件:start_frontend.sh、frontend_server.py、env_config.py
    日志/PID 命名治理继续收口
    统一规则继续落地为 logs/<service>.log、logs/<service>.pid。
    cnclip 保持 logs/cnclip.log + logs/cnclip.pid。
    文件:service_ctl.sh、start_cnclip_service.sh、stop_cnclip_service.sh
    backend/indexer 启动风格统一补齐相关项
    frontend/translator 也对齐到 set -euo pipefail,并用 exec 直启主进程。
    文件:start_frontend.sh、start_translator.sh、start_backend.sh、start_indexer.sh
    legacy 入口清理
    删除:start_servers.py、stop_reranker.sh、stop_translator.sh。
    reranker 停止逻辑并入 service_ctl(含 VLLM::EngineCore 清理)。
    benchmark 脚本改为统一入口:service_ctl.sh stop reranker。
    文件:benchmark_reranker_1000docs.sh
    tangwang
     
  • - 前端 JS 不再写死后端地址:默认 API_BASE_URL 为空串,所有搜索与 suggest 请求改为同源路径 (/search/*),仅在显式注入 window.API_BASE_URL 时才覆盖,避免 .env 中旧的 http://43.166.252.75:6002 等配置污染浏览器请求。
    - 在 scripts/frontend_server.py 上实现轻量级反向代理:拦截 /search/、/admin/、/indexer/ 的 GET/POST/OPTIONS 请求,服务端将请求转发到本机 6002 (BACKEND_PROXY_URL,默认 http://127.0.0.1:6002),并把响应原样返回前端。
    - 通过“浏览器 → web服务器:6003(认证) → GPU:6003(本项目前端) → GPU 本机:6002(后端)”这条链路,彻底绕开 web 服务器 6002 上单独的 Basic Auth,解决了外网访问时前端能打开但搜索请求被 web:6002 拦截的问题。
    - frontend_server 默认不再注入 window.API_BASE_URL,只有在设置 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才向 HTML 注入脚本,确保默认行为始终是同源调用,由 6003 统一代理后端。
    - 更新 frontend/index.html 中的静态 JS 版本号(tenant_facets_config.js 和 app.js),强制浏览器拉取最新脚本,避免旧版前端继续使用硬编码的后端地址。
    
    Made-with: Cursor
    tangwang
     
  • tangwang