Commit 5a01af3c35615d4fd2cd6bd037237769a22abde4

Authored by tangwang
1 parent 6d71d8e0

多模态hashkey调整:1. 加入model_name,2.text/url转hash

docs/常用查询 - sql.sql
... ... @@ -552,3 +552,29 @@ WHERE tenant_id = 162
552 552 ORDER BY id
553 553 LIMIT 50 OFFSET 0; -- 修改OFFSET查看不同页
554 554  
  555 +
  556 +-- ======================================
  557 +-- 12. 查询店铺增量、全量相关数据
  558 +-- ======================================
  559 +1. 查看店铺配置
  560 +```sql
  561 +cd /data/saas-search && MYSQL_PWD='qY8tgodLoA&KT#yQ' mysql -h 10.200.16.14 -P 3316 -u root saas -e "SELECT * FROM shoplazza_shop_config\G"
  562 +```
  563 +
  564 +2. 查看增量、全量条数
  565 +```sql
  566 +cd /data/saas-search && MYSQL_PWD='qY8tgodLoA&KT#yQ' mysql -h 10.200.16.14 -P 3316 -u root saas -e "
  567 +SELECT 'shoplazza_sync_log' AS table_name, COUNT(*) AS row_count FROM shoplazza_sync_log where tenant_id = 163
  568 +UNION ALL
  569 +SELECT 'shoplazza_product_index_increment', COUNT(*) FROM shoplazza_product_index_increment where tenant_id = 163;
  570 +"
  571 +```
  572 +-- ======================================
  573 +-- 12. 重建索引
  574 +-- ======================================
  575 +删除下面两个表中 tenant_id=163的所有行
  576 +shoplazza_sync_log
  577 +shoplazza_product_index_increment
  578 +
  579 +然后触发重新安装:
  580 +https://47167113.myshoplaza.com/admin/oauth/redirect_from_partner_center?client_id=kqN5QTBARwPAEO_ThHi8mikyFC_4DLkwOOrzQsUL3L0
... ...
docs/缓存与Redis使用说明.md
... ... @@ -52,24 +52,27 @@
52 52 ### 2.1 Key 设计
53 53  
54 54 - 统一 helper:`embeddings/cache_keys.py`
55   -- 文本主 key:`build_text_cache_key(text, normalize=...)`
56   -- 图片主 key:`build_image_cache_key(url, normalize=...)`
  55 +- 文本主 key(TEI/BGE):`build_text_cache_key(text, normalize=...)`
  56 +- 多模态图片(CN-CLIP `/embed/image`):`build_image_cache_key(url, normalize=..., model_name=...)`,其中 `model_name` 来自 `services.embedding.image_backends.*.model_name`(与 `embeddings.config.CONFIG.MULTIMODAL_MODEL_NAME` 一致)
  57 +- 多模态文本(CN-CLIP `/embed/clip_text`):`build_clip_text_cache_key(text, normalize=..., model_name=...)`
57 58 - 模板:
58 59  
59 60 ```text
60   -文本: {EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text}
61   -图片: {EMBEDDING_CACHE_PREFIX}:image:embed:norm{0|1}:{url_or_path}
  61 +TEI 文本: {EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text}
  62 +CN-CLIP 图片: {EMBEDDING_CACHE_PREFIX}:image:embed:{model_name}:txt:norm{0|1}:{url_or_path}
  63 +CN-CLIP 文本塔: {EMBEDDING_CACHE_PREFIX}:clip_text:embed:{model_name}:img:norm{0|1}:{text}
62 64 ```
63 65  
64 66 - 字段说明:
65 67 - `EMBEDDING_CACHE_PREFIX`:来自 `REDIS_CONFIG["embedding_cache_prefix"]`,默认值为 `"embedding"`,可通过环境变量 `REDIS_EMBEDDING_CACHE_PREFIX` 覆盖;
  68 + - `model_name`:如 `CN-CLIP/ViT-H-14`;切换模型时自动使用新 key 空间,避免混用旧维度向量;
66 69 - `norm1` / `norm0`:分别表示 `normalize=true` / `normalize=false`;
67   - - `text` / `url_or_path`:当前仍直接使用规范化后的原始输入,不做哈希
  70 + - `text` / `url_or_path`:经 `strip` 后,若 Unicode 长度 **≤** ``CACHE_KEY_RAW_BODY_MAX_CHARS``(默认 256,见 ``embeddings/cache_keys.py``)则原样写入键尾;**更长**则改为 ``h:sha256:<64 hex>``(对 UTF-8 字节做 SHA-256)。TEI 文本与多模态共用 ``_stable_body_for_cache_key``
68 71  
69 72 补充说明:
70 73  
71   -- 本次把 raw key 格式统一成 `embed:norm{0|1}:...`,比以 `norm:` 开头更清晰,也更接近历史命名习惯。
72   -- 当前实现**不再兼容历史 key 协议**,只保留这一套主 key 规则,以降低运行时复杂度和歧义。
  74 +- TEI 文本 raw key 仍为 `embed:norm{0|1}:...`(尾部负载规则同上)。
  75 +- 多模态键名中 `txt` / `img` 段为项目内约定(与 `embeddings/cache_keys.py` 一致),用于区分图片 lane 与 clip 文本 lane。
73 76  
74 77 ### 2.2 Value 与类型
75 78  
... ...
embeddings/README.md
... ... @@ -51,11 +51,12 @@
51 51 - 现在是**双层缓存**:
52 52 - client 侧:`text_encoder.py` / `image_encoder.py`
53 53 - service 侧:`server.py`
54   -- 当前主 key 格式:
55   - - 文本(TEI):`embedding:embed:norm{0|1}:{text}`
56   - - 图片:`embedding:image:embed:norm{0|1}:{url_or_path}`
57   - - CN-CLIP 文本:`embedding:clip_text:clip_mm:norm{0|1}:{text}`
58   -- 当前实现不再兼容历史 key 规则,只保留这一套格式,减少代码路径和缓存歧义。
  54 +- 当前主 key 格式(`model_name` 见 `CONFIG.MULTIMODAL_MODEL_NAME`,与 `services.embedding.image_backends` 一致):
  55 + - 文本(TEI):`embedding:embed:norm{0|1}:{text_or_sha256_digest}`
  56 + - CN-CLIP 图片:`embedding:image:embed:{model_name}:txt:norm{0|1}:{url_or_sha256_digest}`
  57 + - CN-CLIP 文本塔:`embedding:clip_text:embed:{model_name}:img:norm{0|1}:{text_or_sha256_digest}`
  58 +- 尾部负载:长度 ≤ `CACHE_KEY_RAW_BODY_MAX_CHARS`(默认 256,见 `embeddings/cache_keys.py`)用原文;更长用 `h:sha256:<hex>`(TEI 与多模态共用同一辅助函数)。
  59 +- 切换多模态模型会自然换 key 空间;旧键需自行清理或等待过期。
59 60  
60 61 ### 压力隔离与拒绝策略
61 62  
... ...
embeddings/cache_keys.py
1 1 """Shared cache key helpers for embedding inputs.
2 2  
3   -Current canonical raw-key format:
4   -- text (TEI/BGE): ``embed:norm1:<text>`` / ``embed:norm0:<text>``
5   -- image (CLIP): ``embed:norm1:<url>`` / ``embed:norm0:<url>``
6   -- clip_text (CN-CLIP 文本,与图同空间): ``clip_mm:norm1:<text>`` / ``clip_mm:norm0:<text>``
  3 +Multimodal (CN-CLIP) raw keys include ``model_name`` so switching ViT-L / ViT-H does not reuse stale vectors.
  4 +
  5 +- 图片:``embed:{model_name}:txt:norm{0|1}:<url_or_digest>``
  6 +- 多模态文本(与 /embed/image 同空间):``embed:{model_name}:img:norm{0|1}:<text_or_digest>``
  7 +
  8 +TEI/BGE 文本(title_embedding 等):``embed:norm{0|1}:<text_or_digest>``
  9 +
  10 +超长 URL/文本(按 Unicode 码点计数,超过 ``CACHE_KEY_RAW_BODY_MAX_CHARS``)时,尾部负载改为
  11 +``h:sha256:<64 hex>``,避免 Redis key 过长。
7 12  
8 13 `RedisEmbeddingCache` adds the configured key prefix and optional namespace on top.
9 14 """
10 15  
11 16 from __future__ import annotations
12 17  
  18 +import hashlib
  19 +
  20 +# Max length (Unicode codepoints) of the raw URL/text segment before switching to SHA256 digest form.
  21 +CACHE_KEY_RAW_BODY_MAX_CHARS = 256
  22 +
  23 +
  24 +def _stable_body_for_cache_key(body: str, *, max_chars: int | None = None) -> str:
  25 + """
  26 + Return ``body`` unchanged when ``len(body) <= max_chars``; otherwise a fixed-length digest key.
  27 +
  28 + Hash is SHA-256 over UTF-8 bytes of ``body``; prefix ``h:sha256:`` avoids collision with literals.
  29 + """
  30 + if max_chars is None:
  31 + max_chars = CACHE_KEY_RAW_BODY_MAX_CHARS
  32 + if len(body) <= max_chars:
  33 + return body
  34 + digest = hashlib.sha256(body.encode("utf-8")).hexdigest()
  35 + return f"h:sha256:{digest}"
  36 +
13 37  
14 38 def build_text_cache_key(text: str, *, normalize: bool) -> str:
15 39 normalized_text = str(text or "").strip()
16   - return f"embed:norm{1 if normalize else 0}:{normalized_text}"
  40 + payload = _stable_body_for_cache_key(normalized_text)
  41 + return f"embed:norm{1 if normalize else 0}:{payload}"
17 42  
18 43  
19   -def build_image_cache_key(url: str, *, normalize: bool) -> str:
  44 +def build_image_cache_key(url: str, *, normalize: bool, model_name: str) -> str:
  45 + """CN-CLIP 图片向量缓存逻辑键(业务约定段名为 txt)。"""
20 46 normalized_url = str(url or "").strip()
21   - return f"embed:norm{1 if normalize else 0}:{normalized_url}"
  47 + payload = _stable_body_for_cache_key(normalized_url)
  48 + m = str(model_name or "").strip() or "unknown"
  49 + return f"embed:{m}:txt:norm{1 if normalize else 0}:{payload}"
22 50  
23 51  
24   -def build_clip_text_cache_key(text: str, *, normalize: bool) -> str:
25   - """CN-CLIP / multimodal text (same vector space as /embed/image)."""
  52 +def build_clip_text_cache_key(text: str, *, normalize: bool, model_name: str) -> str:
  53 + """CN-CLIP 文本塔缓存逻辑键(与图同空间;业务约定段名为 img)。"""
26 54 normalized_text = str(text or "").strip()
27   - return f"clip_mm:norm{1 if normalize else 0}:{normalized_text}"
  55 + payload = _stable_body_for_cache_key(normalized_text)
  56 + m = str(model_name or "").strip() or "unknown"
  57 + return f"embed:{m}:img:norm{1 if normalize else 0}:{payload}"
... ...
embeddings/config.py
... ... @@ -37,6 +37,11 @@ class EmbeddingConfig(object):
37 37 self.CLIP_AS_SERVICE_MODEL_NAME = str(image_backend.get("model_name") or "CN-CLIP/ViT-H-14")
38 38  
39 39 self.IMAGE_MODEL_NAME = str(image_backend.get("model_name") or "ViT-H-14")
  40 + # Redis multimodal cache keys (image + clip_text) include this string; change model → new key space.
  41 + self.MULTIMODAL_MODEL_NAME = str(
  42 + image_backend.get("model_name")
  43 + or ("CN-CLIP/ViT-H-14" if self.USE_CLIP_AS_SERVICE else "ViT-H-14")
  44 + )
40 45 self.IMAGE_DEVICE = image_backend.get("device") # type: Optional[str]
41 46 self.IMAGE_BATCH_SIZE = int(image_backend.get("batch_size", 8))
42 47 self.IMAGE_NORMALIZE_EMBEDDINGS = bool(image_backend.get("normalize_embeddings", True))
... ...
embeddings/image_encoder.py
... ... @@ -12,6 +12,7 @@ logger = logging.getLogger(__name__)
12 12 from config.loader import get_app_config
13 13 from config.services_config import get_embedding_image_backend_config, get_embedding_image_base_url
14 14 from embeddings.cache_keys import build_clip_text_cache_key, build_image_cache_key
  15 +from embeddings.config import CONFIG
15 16 from embeddings.redis_embedding_cache import RedisEmbeddingCache
16 17 from request_log_context import build_downstream_request_headers, build_request_log_extra
17 18  
... ... @@ -31,6 +32,7 @@ class CLIPImageEncoder:
31 32 self.clip_text_endpoint = f"{self.service_url}/embed/clip_text"
32 33 # Reuse embedding cache prefix, but separate namespace for images to avoid collisions.
33 34 self.cache_prefix = str(redis_config.embedding_cache_prefix).strip() or "embedding"
  35 + self._mm_model_name = CONFIG.MULTIMODAL_MODEL_NAME
34 36 logger.info("Creating CLIPImageEncoder instance with service URL: %s", self.service_url)
35 37 self.cache = RedisEmbeddingCache(
36 38 key_prefix=self.cache_prefix,
... ... @@ -171,7 +173,9 @@ class CLIPImageEncoder:
171 173 """
172 174 CN-CLIP 文本塔(与 ``/embed/image`` 同向量空间),对应服务端 ``POST /embed/clip_text``。
173 175 """
174   - cache_key = build_clip_text_cache_key(text, normalize=normalize_embeddings)
  176 + cache_key = build_clip_text_cache_key(
  177 + text, normalize=normalize_embeddings, model_name=self._mm_model_name
  178 + )
175 179 cached = self._clip_text_cache.get(cache_key)
176 180 if cached is not None:
177 181 return cached
... ... @@ -216,7 +220,9 @@ class CLIPImageEncoder:
216 220 Returns:
217 221 Embedding vector
218 222 """
219   - cache_key = build_image_cache_key(url, normalize=normalize_embeddings)
  223 + cache_key = build_image_cache_key(
  224 + url, normalize=normalize_embeddings, model_name=self._mm_model_name
  225 + )
220 226 cached = self.cache.get(cache_key)
221 227 if cached is not None:
222 228 return cached
... ... @@ -267,7 +273,9 @@ class CLIPImageEncoder:
267 273  
268 274 normalized_urls = [str(u).strip() for u in images] # type: ignore[list-item]
269 275 for pos, url in enumerate(normalized_urls):
270   - cache_key = build_image_cache_key(url, normalize=normalize_embeddings)
  276 + cache_key = build_image_cache_key(
  277 + url, normalize=normalize_embeddings, model_name=self._mm_model_name
  278 + )
271 279 cached = self.cache.get(cache_key)
272 280 if cached is not None:
273 281 results.append(cached)
... ... @@ -297,7 +305,12 @@ class CLIPImageEncoder:
297 305 vec = np.array(embedding, dtype=np.float32)
298 306 if vec.ndim != 1 or vec.size == 0 or not np.isfinite(vec).all():
299 307 raise RuntimeError(f"Invalid image embedding returned for URL: {url}")
300   - self.cache.set(build_image_cache_key(url, normalize=normalize_embeddings), vec)
  308 + self.cache.set(
  309 + build_image_cache_key(
  310 + url, normalize=normalize_embeddings, model_name=self._mm_model_name
  311 + ),
  312 + vec,
  313 + )
301 314 pos = pending_positions[i + j]
302 315 results[pos] = vec
303 316  
... ...
embeddings/server.py
... ... @@ -23,7 +23,11 @@ from fastapi.concurrency import run_in_threadpool
23 23  
24 24 from config.env_config import REDIS_CONFIG
25 25 from config.services_config import get_embedding_backend_config
26   -from embeddings.cache_keys import build_clip_text_cache_key, build_image_cache_key, build_text_cache_key
  26 +from embeddings.cache_keys import (
  27 + build_clip_text_cache_key as _mm_clip_text_cache_key,
  28 + build_image_cache_key as _mm_image_cache_key,
  29 + build_text_cache_key,
  30 +)
27 31 from embeddings.config import CONFIG
28 32 from embeddings.protocols import ImageEncoderProtocol
29 33 from embeddings.redis_embedding_cache import RedisEmbeddingCache
... ... @@ -763,10 +767,14 @@ def _try_full_image_lane_cache_hit(
763 767 out: List[Optional[List[float]]] = []
764 768 for item in items:
765 769 if lane == "image":
766   - ck = build_image_cache_key(item, normalize=effective_normalize)
  770 + ck = _mm_image_cache_key(
  771 + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME
  772 + )
767 773 cached = _image_cache.get(ck)
768 774 else:
769   - ck = build_clip_text_cache_key(item, normalize=effective_normalize)
  775 + ck = _mm_clip_text_cache_key(
  776 + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME
  777 + )
770 778 cached = _clip_text_cache.get(ck)
771 779 if cached is None:
772 780 return None
... ... @@ -801,10 +809,14 @@ def _embed_image_lane_impl(
801 809 cache_hits = 0
802 810 for idx, item in enumerate(items):
803 811 if lane == "image":
804   - ck = build_image_cache_key(item, normalize=effective_normalize)
  812 + ck = _mm_image_cache_key(
  813 + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME
  814 + )
805 815 cached = _image_cache.get(ck)
806 816 else:
807   - ck = build_clip_text_cache_key(item, normalize=effective_normalize)
  817 + ck = _mm_clip_text_cache_key(
  818 + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME
  819 + )
808 820 cached = _clip_text_cache.get(ck)
809 821 if cached is not None:
810 822 vec = _as_list(cached, normalize=False)
... ... @@ -1497,3 +1509,17 @@ async def embed_clip_text(
1497 1509 priority=priority,
1498 1510 preview_chars=_LOG_TEXT_PREVIEW_CHARS,
1499 1511 )
  1512 +
  1513 +
  1514 +def build_image_cache_key(url: str, *, normalize: bool, model_name: Optional[str] = None) -> str:
  1515 + """Tests/tools: same key as ``/embed/image`` lane; defaults to ``CONFIG.MULTIMODAL_MODEL_NAME``."""
  1516 + return _mm_image_cache_key(
  1517 + url, normalize=normalize, model_name=model_name or CONFIG.MULTIMODAL_MODEL_NAME
  1518 + )
  1519 +
  1520 +
  1521 +def build_clip_text_cache_key(text: str, *, normalize: bool, model_name: Optional[str] = None) -> str:
  1522 + """Tests/tools: same key as ``/embed/clip_text`` lane; defaults to ``CONFIG.MULTIMODAL_MODEL_NAME``."""
  1523 + return _mm_clip_text_cache_key(
  1524 + text, normalize=normalize, model_name=model_name or CONFIG.MULTIMODAL_MODEL_NAME
  1525 + )
... ...
scripts/create_tenant_index.sh
... ... @@ -61,6 +61,7 @@ echo &quot;删除索引: $ES_INDEX&quot;
61 61 echo
62 62 curl -X DELETE "${ES_HOST}/${ES_INDEX}" $AUTH_PARAM -s -o /dev/null -w "HTTP状态码: %{http_code}\n"
63 63  
  64 +
64 65 echo
65 66 echo "创建索引: $ES_INDEX"
66 67 echo
... ...
tests/test_cache_keys.py 0 → 100644
... ... @@ -0,0 +1,47 @@
  1 +"""Unit tests for embeddings/cache_keys.py (hashing long bodies)."""
  2 +
  3 +import hashlib
  4 +
  5 +from embeddings import cache_keys as ck
  6 +
  7 +
  8 +def test_stable_body_short_unchanged():
  9 + s = "a" * ck.CACHE_KEY_RAW_BODY_MAX_CHARS
  10 + assert ck._stable_body_for_cache_key(s) == s
  11 +
  12 +
  13 +def test_stable_body_long_hashes():
  14 + s = "a" * (ck.CACHE_KEY_RAW_BODY_MAX_CHARS + 1)
  15 + out = ck._stable_body_for_cache_key(s)
  16 + assert out == "h:sha256:" + hashlib.sha256(s.encode("utf-8")).hexdigest()
  17 + assert out.startswith("h:sha256:")
  18 + assert len(out) == len("h:sha256:") + 64
  19 +
  20 +
  21 +def test_stable_body_utf8_counts_unicode_codepoints():
  22 + # 2 codepoints, not 6 bytes — still short
  23 + s = "你好"
  24 + assert ck._stable_body_for_cache_key(s) == s
  25 +
  26 +
  27 +def test_build_text_cache_key_uses_digest_when_long():
  28 + # Default max 256: 257 'x' -> digest
  29 + long_text = "x" * (ck.CACHE_KEY_RAW_BODY_MAX_CHARS + 1)
  30 + key = ck.build_text_cache_key(long_text, normalize=True)
  31 + assert key.startswith("embed:norm1:h:sha256:")
  32 + digest = hashlib.sha256(long_text.encode("utf-8")).hexdigest()
  33 + assert key == f"embed:norm1:h:sha256:{digest}"
  34 +
  35 +
  36 +def test_build_image_cache_key_uses_digest_when_long():
  37 + url = "https://x.example/" + "y" * ck.CACHE_KEY_RAW_BODY_MAX_CHARS
  38 + key = ck.build_image_cache_key(url, normalize=True, model_name="CN-CLIP/ViT-H-14")
  39 + digest = hashlib.sha256(url.encode("utf-8")).hexdigest()
  40 + assert key == f"embed:CN-CLIP/ViT-H-14:txt:norm1:h:sha256:{digest}"
  41 +
  42 +
  43 +def test_build_clip_text_cache_key_uses_digest_when_long():
  44 + t = "词" * (ck.CACHE_KEY_RAW_BODY_MAX_CHARS + 1)
  45 + key = ck.build_clip_text_cache_key(t, normalize=False, model_name="m")
  46 + digest = hashlib.sha256(t.encode("utf-8")).hexdigest()
  47 + assert key == f"embed:m:img:norm0:h:sha256:{digest}"
... ...
tests/test_embedding_pipeline.py
... ... @@ -16,6 +16,7 @@ from embeddings.image_encoder import CLIPImageEncoder
16 16 from embeddings.text_embedding_tei import TEITextModel
17 17 from embeddings.bf16 import encode_embedding_for_redis
18 18 from embeddings.cache_keys import build_image_cache_key, build_text_cache_key
  19 +from embeddings.config import CONFIG
19 20 from query import QueryParser
20 21 from context.request_context import create_request_context, set_current_request_context, clear_current_request_context
21 22  
... ... @@ -207,7 +208,9 @@ def test_image_embedding_encoder_cache_hit(monkeypatch):
207 208 fake_cache = _FakeEmbeddingCache()
208 209 cached = np.array([0.5, 0.6], dtype=np.float32)
209 210 url = "https://example.com/a.jpg"
210   - fake_cache.store[build_image_cache_key(url, normalize=True)] = cached
  211 + fake_cache.store[
  212 + build_image_cache_key(url, normalize=True, model_name=CONFIG.MULTIMODAL_MODEL_NAME)
  213 + ] = cached
211 214 monkeypatch.setattr("embeddings.image_encoder.RedisEmbeddingCache", lambda **kwargs: fake_cache)
212 215  
213 216 calls = {"count": 0}
... ...