Commit 5a01af3c35615d4fd2cd6bd037237769a22abde4
1 parent
6d71d8e0
多模态hashkey调整:1. 加入model_name,2.text/url转hash
Showing
10 changed files
with
187 additions
and
32 deletions
Show diff stats
docs/常用查询 - sql.sql
| ... | ... | @@ -552,3 +552,29 @@ WHERE tenant_id = 162 |
| 552 | 552 | ORDER BY id |
| 553 | 553 | LIMIT 50 OFFSET 0; -- 修改OFFSET查看不同页 |
| 554 | 554 | |
| 555 | + | |
| 556 | +-- ====================================== | |
| 557 | +-- 12. 查询店铺增量、全量相关数据 | |
| 558 | +-- ====================================== | |
| 559 | +1. 查看店铺配置 | |
| 560 | +```sql | |
| 561 | +cd /data/saas-search && MYSQL_PWD='qY8tgodLoA&KT#yQ' mysql -h 10.200.16.14 -P 3316 -u root saas -e "SELECT * FROM shoplazza_shop_config\G" | |
| 562 | +``` | |
| 563 | + | |
| 564 | +2. 查看增量、全量条数 | |
| 565 | +```sql | |
| 566 | +cd /data/saas-search && MYSQL_PWD='qY8tgodLoA&KT#yQ' mysql -h 10.200.16.14 -P 3316 -u root saas -e " | |
| 567 | +SELECT 'shoplazza_sync_log' AS table_name, COUNT(*) AS row_count FROM shoplazza_sync_log where tenant_id = 163 | |
| 568 | +UNION ALL | |
| 569 | +SELECT 'shoplazza_product_index_increment', COUNT(*) FROM shoplazza_product_index_increment where tenant_id = 163; | |
| 570 | +" | |
| 571 | +``` | |
| 572 | +-- ====================================== | |
| 573 | +-- 12. 重建索引 | |
| 574 | +-- ====================================== | |
| 575 | +删除下面两个表中 tenant_id=163的所有行 | |
| 576 | +shoplazza_sync_log | |
| 577 | +shoplazza_product_index_increment | |
| 578 | + | |
| 579 | +然后触发重新安装: | |
| 580 | +https://47167113.myshoplaza.com/admin/oauth/redirect_from_partner_center?client_id=kqN5QTBARwPAEO_ThHi8mikyFC_4DLkwOOrzQsUL3L0 | ... | ... |
docs/缓存与Redis使用说明.md
| ... | ... | @@ -52,24 +52,27 @@ |
| 52 | 52 | ### 2.1 Key 设计 |
| 53 | 53 | |
| 54 | 54 | - 统一 helper:`embeddings/cache_keys.py` |
| 55 | -- 文本主 key:`build_text_cache_key(text, normalize=...)` | |
| 56 | -- 图片主 key:`build_image_cache_key(url, normalize=...)` | |
| 55 | +- 文本主 key(TEI/BGE):`build_text_cache_key(text, normalize=...)` | |
| 56 | +- 多模态图片(CN-CLIP `/embed/image`):`build_image_cache_key(url, normalize=..., model_name=...)`,其中 `model_name` 来自 `services.embedding.image_backends.*.model_name`(与 `embeddings.config.CONFIG.MULTIMODAL_MODEL_NAME` 一致) | |
| 57 | +- 多模态文本(CN-CLIP `/embed/clip_text`):`build_clip_text_cache_key(text, normalize=..., model_name=...)` | |
| 57 | 58 | - 模板: |
| 58 | 59 | |
| 59 | 60 | ```text |
| 60 | -文本: {EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text} | |
| 61 | -图片: {EMBEDDING_CACHE_PREFIX}:image:embed:norm{0|1}:{url_or_path} | |
| 61 | +TEI 文本: {EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text} | |
| 62 | +CN-CLIP 图片: {EMBEDDING_CACHE_PREFIX}:image:embed:{model_name}:txt:norm{0|1}:{url_or_path} | |
| 63 | +CN-CLIP 文本塔: {EMBEDDING_CACHE_PREFIX}:clip_text:embed:{model_name}:img:norm{0|1}:{text} | |
| 62 | 64 | ``` |
| 63 | 65 | |
| 64 | 66 | - 字段说明: |
| 65 | 67 | - `EMBEDDING_CACHE_PREFIX`:来自 `REDIS_CONFIG["embedding_cache_prefix"]`,默认值为 `"embedding"`,可通过环境变量 `REDIS_EMBEDDING_CACHE_PREFIX` 覆盖; |
| 68 | + - `model_name`:如 `CN-CLIP/ViT-H-14`;切换模型时自动使用新 key 空间,避免混用旧维度向量; | |
| 66 | 69 | - `norm1` / `norm0`:分别表示 `normalize=true` / `normalize=false`; |
| 67 | - - `text` / `url_or_path`:当前仍直接使用规范化后的原始输入,不做哈希。 | |
| 70 | + - `text` / `url_or_path`:经 `strip` 后,若 Unicode 长度 **≤** ``CACHE_KEY_RAW_BODY_MAX_CHARS``(默认 256,见 ``embeddings/cache_keys.py``)则原样写入键尾;**更长**则改为 ``h:sha256:<64 hex>``(对 UTF-8 字节做 SHA-256)。TEI 文本与多模态共用 ``_stable_body_for_cache_key``。 | |
| 68 | 71 | |
| 69 | 72 | 补充说明: |
| 70 | 73 | |
| 71 | -- 本次把 raw key 格式统一成 `embed:norm{0|1}:...`,比以 `norm:` 开头更清晰,也更接近历史命名习惯。 | |
| 72 | -- 当前实现**不再兼容历史 key 协议**,只保留这一套主 key 规则,以降低运行时复杂度和歧义。 | |
| 74 | +- TEI 文本 raw key 仍为 `embed:norm{0|1}:...`(尾部负载规则同上)。 | |
| 75 | +- 多模态键名中 `txt` / `img` 段为项目内约定(与 `embeddings/cache_keys.py` 一致),用于区分图片 lane 与 clip 文本 lane。 | |
| 73 | 76 | |
| 74 | 77 | ### 2.2 Value 与类型 |
| 75 | 78 | ... | ... |
embeddings/README.md
| ... | ... | @@ -51,11 +51,12 @@ |
| 51 | 51 | - 现在是**双层缓存**: |
| 52 | 52 | - client 侧:`text_encoder.py` / `image_encoder.py` |
| 53 | 53 | - service 侧:`server.py` |
| 54 | -- 当前主 key 格式: | |
| 55 | - - 文本(TEI):`embedding:embed:norm{0|1}:{text}` | |
| 56 | - - 图片:`embedding:image:embed:norm{0|1}:{url_or_path}` | |
| 57 | - - CN-CLIP 文本:`embedding:clip_text:clip_mm:norm{0|1}:{text}` | |
| 58 | -- 当前实现不再兼容历史 key 规则,只保留这一套格式,减少代码路径和缓存歧义。 | |
| 54 | +- 当前主 key 格式(`model_name` 见 `CONFIG.MULTIMODAL_MODEL_NAME`,与 `services.embedding.image_backends` 一致): | |
| 55 | + - 文本(TEI):`embedding:embed:norm{0|1}:{text_or_sha256_digest}` | |
| 56 | + - CN-CLIP 图片:`embedding:image:embed:{model_name}:txt:norm{0|1}:{url_or_sha256_digest}` | |
| 57 | + - CN-CLIP 文本塔:`embedding:clip_text:embed:{model_name}:img:norm{0|1}:{text_or_sha256_digest}` | |
| 58 | +- 尾部负载:长度 ≤ `CACHE_KEY_RAW_BODY_MAX_CHARS`(默认 256,见 `embeddings/cache_keys.py`)用原文;更长用 `h:sha256:<hex>`(TEI 与多模态共用同一辅助函数)。 | |
| 59 | +- 切换多模态模型会自然换 key 空间;旧键需自行清理或等待过期。 | |
| 59 | 60 | |
| 60 | 61 | ### 压力隔离与拒绝策略 |
| 61 | 62 | ... | ... |
embeddings/cache_keys.py
| 1 | 1 | """Shared cache key helpers for embedding inputs. |
| 2 | 2 | |
| 3 | -Current canonical raw-key format: | |
| 4 | -- text (TEI/BGE): ``embed:norm1:<text>`` / ``embed:norm0:<text>`` | |
| 5 | -- image (CLIP): ``embed:norm1:<url>`` / ``embed:norm0:<url>`` | |
| 6 | -- clip_text (CN-CLIP 文本,与图同空间): ``clip_mm:norm1:<text>`` / ``clip_mm:norm0:<text>`` | |
| 3 | +Multimodal (CN-CLIP) raw keys include ``model_name`` so switching ViT-L / ViT-H does not reuse stale vectors. | |
| 4 | + | |
| 5 | +- 图片:``embed:{model_name}:txt:norm{0|1}:<url_or_digest>`` | |
| 6 | +- 多模态文本(与 /embed/image 同空间):``embed:{model_name}:img:norm{0|1}:<text_or_digest>`` | |
| 7 | + | |
| 8 | +TEI/BGE 文本(title_embedding 等):``embed:norm{0|1}:<text_or_digest>`` | |
| 9 | + | |
| 10 | +超长 URL/文本(按 Unicode 码点计数,超过 ``CACHE_KEY_RAW_BODY_MAX_CHARS``)时,尾部负载改为 | |
| 11 | +``h:sha256:<64 hex>``,避免 Redis key 过长。 | |
| 7 | 12 | |
| 8 | 13 | `RedisEmbeddingCache` adds the configured key prefix and optional namespace on top. |
| 9 | 14 | """ |
| 10 | 15 | |
| 11 | 16 | from __future__ import annotations |
| 12 | 17 | |
| 18 | +import hashlib | |
| 19 | + | |
| 20 | +# Max length (Unicode codepoints) of the raw URL/text segment before switching to SHA256 digest form. | |
| 21 | +CACHE_KEY_RAW_BODY_MAX_CHARS = 256 | |
| 22 | + | |
| 23 | + | |
| 24 | +def _stable_body_for_cache_key(body: str, *, max_chars: int | None = None) -> str: | |
| 25 | + """ | |
| 26 | + Return ``body`` unchanged when ``len(body) <= max_chars``; otherwise a fixed-length digest key. | |
| 27 | + | |
| 28 | + Hash is SHA-256 over UTF-8 bytes of ``body``; prefix ``h:sha256:`` avoids collision with literals. | |
| 29 | + """ | |
| 30 | + if max_chars is None: | |
| 31 | + max_chars = CACHE_KEY_RAW_BODY_MAX_CHARS | |
| 32 | + if len(body) <= max_chars: | |
| 33 | + return body | |
| 34 | + digest = hashlib.sha256(body.encode("utf-8")).hexdigest() | |
| 35 | + return f"h:sha256:{digest}" | |
| 36 | + | |
| 13 | 37 | |
| 14 | 38 | def build_text_cache_key(text: str, *, normalize: bool) -> str: |
| 15 | 39 | normalized_text = str(text or "").strip() |
| 16 | - return f"embed:norm{1 if normalize else 0}:{normalized_text}" | |
| 40 | + payload = _stable_body_for_cache_key(normalized_text) | |
| 41 | + return f"embed:norm{1 if normalize else 0}:{payload}" | |
| 17 | 42 | |
| 18 | 43 | |
| 19 | -def build_image_cache_key(url: str, *, normalize: bool) -> str: | |
| 44 | +def build_image_cache_key(url: str, *, normalize: bool, model_name: str) -> str: | |
| 45 | + """CN-CLIP 图片向量缓存逻辑键(业务约定段名为 txt)。""" | |
| 20 | 46 | normalized_url = str(url or "").strip() |
| 21 | - return f"embed:norm{1 if normalize else 0}:{normalized_url}" | |
| 47 | + payload = _stable_body_for_cache_key(normalized_url) | |
| 48 | + m = str(model_name or "").strip() or "unknown" | |
| 49 | + return f"embed:{m}:txt:norm{1 if normalize else 0}:{payload}" | |
| 22 | 50 | |
| 23 | 51 | |
| 24 | -def build_clip_text_cache_key(text: str, *, normalize: bool) -> str: | |
| 25 | - """CN-CLIP / multimodal text (same vector space as /embed/image).""" | |
| 52 | +def build_clip_text_cache_key(text: str, *, normalize: bool, model_name: str) -> str: | |
| 53 | + """CN-CLIP 文本塔缓存逻辑键(与图同空间;业务约定段名为 img)。""" | |
| 26 | 54 | normalized_text = str(text or "").strip() |
| 27 | - return f"clip_mm:norm{1 if normalize else 0}:{normalized_text}" | |
| 55 | + payload = _stable_body_for_cache_key(normalized_text) | |
| 56 | + m = str(model_name or "").strip() or "unknown" | |
| 57 | + return f"embed:{m}:img:norm{1 if normalize else 0}:{payload}" | ... | ... |
embeddings/config.py
| ... | ... | @@ -37,6 +37,11 @@ class EmbeddingConfig(object): |
| 37 | 37 | self.CLIP_AS_SERVICE_MODEL_NAME = str(image_backend.get("model_name") or "CN-CLIP/ViT-H-14") |
| 38 | 38 | |
| 39 | 39 | self.IMAGE_MODEL_NAME = str(image_backend.get("model_name") or "ViT-H-14") |
| 40 | + # Redis multimodal cache keys (image + clip_text) include this string; change model → new key space. | |
| 41 | + self.MULTIMODAL_MODEL_NAME = str( | |
| 42 | + image_backend.get("model_name") | |
| 43 | + or ("CN-CLIP/ViT-H-14" if self.USE_CLIP_AS_SERVICE else "ViT-H-14") | |
| 44 | + ) | |
| 40 | 45 | self.IMAGE_DEVICE = image_backend.get("device") # type: Optional[str] |
| 41 | 46 | self.IMAGE_BATCH_SIZE = int(image_backend.get("batch_size", 8)) |
| 42 | 47 | self.IMAGE_NORMALIZE_EMBEDDINGS = bool(image_backend.get("normalize_embeddings", True)) | ... | ... |
embeddings/image_encoder.py
| ... | ... | @@ -12,6 +12,7 @@ logger = logging.getLogger(__name__) |
| 12 | 12 | from config.loader import get_app_config |
| 13 | 13 | from config.services_config import get_embedding_image_backend_config, get_embedding_image_base_url |
| 14 | 14 | from embeddings.cache_keys import build_clip_text_cache_key, build_image_cache_key |
| 15 | +from embeddings.config import CONFIG | |
| 15 | 16 | from embeddings.redis_embedding_cache import RedisEmbeddingCache |
| 16 | 17 | from request_log_context import build_downstream_request_headers, build_request_log_extra |
| 17 | 18 | |
| ... | ... | @@ -31,6 +32,7 @@ class CLIPImageEncoder: |
| 31 | 32 | self.clip_text_endpoint = f"{self.service_url}/embed/clip_text" |
| 32 | 33 | # Reuse embedding cache prefix, but separate namespace for images to avoid collisions. |
| 33 | 34 | self.cache_prefix = str(redis_config.embedding_cache_prefix).strip() or "embedding" |
| 35 | + self._mm_model_name = CONFIG.MULTIMODAL_MODEL_NAME | |
| 34 | 36 | logger.info("Creating CLIPImageEncoder instance with service URL: %s", self.service_url) |
| 35 | 37 | self.cache = RedisEmbeddingCache( |
| 36 | 38 | key_prefix=self.cache_prefix, |
| ... | ... | @@ -171,7 +173,9 @@ class CLIPImageEncoder: |
| 171 | 173 | """ |
| 172 | 174 | CN-CLIP 文本塔(与 ``/embed/image`` 同向量空间),对应服务端 ``POST /embed/clip_text``。 |
| 173 | 175 | """ |
| 174 | - cache_key = build_clip_text_cache_key(text, normalize=normalize_embeddings) | |
| 176 | + cache_key = build_clip_text_cache_key( | |
| 177 | + text, normalize=normalize_embeddings, model_name=self._mm_model_name | |
| 178 | + ) | |
| 175 | 179 | cached = self._clip_text_cache.get(cache_key) |
| 176 | 180 | if cached is not None: |
| 177 | 181 | return cached |
| ... | ... | @@ -216,7 +220,9 @@ class CLIPImageEncoder: |
| 216 | 220 | Returns: |
| 217 | 221 | Embedding vector |
| 218 | 222 | """ |
| 219 | - cache_key = build_image_cache_key(url, normalize=normalize_embeddings) | |
| 223 | + cache_key = build_image_cache_key( | |
| 224 | + url, normalize=normalize_embeddings, model_name=self._mm_model_name | |
| 225 | + ) | |
| 220 | 226 | cached = self.cache.get(cache_key) |
| 221 | 227 | if cached is not None: |
| 222 | 228 | return cached |
| ... | ... | @@ -267,7 +273,9 @@ class CLIPImageEncoder: |
| 267 | 273 | |
| 268 | 274 | normalized_urls = [str(u).strip() for u in images] # type: ignore[list-item] |
| 269 | 275 | for pos, url in enumerate(normalized_urls): |
| 270 | - cache_key = build_image_cache_key(url, normalize=normalize_embeddings) | |
| 276 | + cache_key = build_image_cache_key( | |
| 277 | + url, normalize=normalize_embeddings, model_name=self._mm_model_name | |
| 278 | + ) | |
| 271 | 279 | cached = self.cache.get(cache_key) |
| 272 | 280 | if cached is not None: |
| 273 | 281 | results.append(cached) |
| ... | ... | @@ -297,7 +305,12 @@ class CLIPImageEncoder: |
| 297 | 305 | vec = np.array(embedding, dtype=np.float32) |
| 298 | 306 | if vec.ndim != 1 or vec.size == 0 or not np.isfinite(vec).all(): |
| 299 | 307 | raise RuntimeError(f"Invalid image embedding returned for URL: {url}") |
| 300 | - self.cache.set(build_image_cache_key(url, normalize=normalize_embeddings), vec) | |
| 308 | + self.cache.set( | |
| 309 | + build_image_cache_key( | |
| 310 | + url, normalize=normalize_embeddings, model_name=self._mm_model_name | |
| 311 | + ), | |
| 312 | + vec, | |
| 313 | + ) | |
| 301 | 314 | pos = pending_positions[i + j] |
| 302 | 315 | results[pos] = vec |
| 303 | 316 | ... | ... |
embeddings/server.py
| ... | ... | @@ -23,7 +23,11 @@ from fastapi.concurrency import run_in_threadpool |
| 23 | 23 | |
| 24 | 24 | from config.env_config import REDIS_CONFIG |
| 25 | 25 | from config.services_config import get_embedding_backend_config |
| 26 | -from embeddings.cache_keys import build_clip_text_cache_key, build_image_cache_key, build_text_cache_key | |
| 26 | +from embeddings.cache_keys import ( | |
| 27 | + build_clip_text_cache_key as _mm_clip_text_cache_key, | |
| 28 | + build_image_cache_key as _mm_image_cache_key, | |
| 29 | + build_text_cache_key, | |
| 30 | +) | |
| 27 | 31 | from embeddings.config import CONFIG |
| 28 | 32 | from embeddings.protocols import ImageEncoderProtocol |
| 29 | 33 | from embeddings.redis_embedding_cache import RedisEmbeddingCache |
| ... | ... | @@ -763,10 +767,14 @@ def _try_full_image_lane_cache_hit( |
| 763 | 767 | out: List[Optional[List[float]]] = [] |
| 764 | 768 | for item in items: |
| 765 | 769 | if lane == "image": |
| 766 | - ck = build_image_cache_key(item, normalize=effective_normalize) | |
| 770 | + ck = _mm_image_cache_key( | |
| 771 | + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME | |
| 772 | + ) | |
| 767 | 773 | cached = _image_cache.get(ck) |
| 768 | 774 | else: |
| 769 | - ck = build_clip_text_cache_key(item, normalize=effective_normalize) | |
| 775 | + ck = _mm_clip_text_cache_key( | |
| 776 | + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME | |
| 777 | + ) | |
| 770 | 778 | cached = _clip_text_cache.get(ck) |
| 771 | 779 | if cached is None: |
| 772 | 780 | return None |
| ... | ... | @@ -801,10 +809,14 @@ def _embed_image_lane_impl( |
| 801 | 809 | cache_hits = 0 |
| 802 | 810 | for idx, item in enumerate(items): |
| 803 | 811 | if lane == "image": |
| 804 | - ck = build_image_cache_key(item, normalize=effective_normalize) | |
| 812 | + ck = _mm_image_cache_key( | |
| 813 | + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME | |
| 814 | + ) | |
| 805 | 815 | cached = _image_cache.get(ck) |
| 806 | 816 | else: |
| 807 | - ck = build_clip_text_cache_key(item, normalize=effective_normalize) | |
| 817 | + ck = _mm_clip_text_cache_key( | |
| 818 | + item, normalize=effective_normalize, model_name=CONFIG.MULTIMODAL_MODEL_NAME | |
| 819 | + ) | |
| 808 | 820 | cached = _clip_text_cache.get(ck) |
| 809 | 821 | if cached is not None: |
| 810 | 822 | vec = _as_list(cached, normalize=False) |
| ... | ... | @@ -1497,3 +1509,17 @@ async def embed_clip_text( |
| 1497 | 1509 | priority=priority, |
| 1498 | 1510 | preview_chars=_LOG_TEXT_PREVIEW_CHARS, |
| 1499 | 1511 | ) |
| 1512 | + | |
| 1513 | + | |
| 1514 | +def build_image_cache_key(url: str, *, normalize: bool, model_name: Optional[str] = None) -> str: | |
| 1515 | + """Tests/tools: same key as ``/embed/image`` lane; defaults to ``CONFIG.MULTIMODAL_MODEL_NAME``.""" | |
| 1516 | + return _mm_image_cache_key( | |
| 1517 | + url, normalize=normalize, model_name=model_name or CONFIG.MULTIMODAL_MODEL_NAME | |
| 1518 | + ) | |
| 1519 | + | |
| 1520 | + | |
| 1521 | +def build_clip_text_cache_key(text: str, *, normalize: bool, model_name: Optional[str] = None) -> str: | |
| 1522 | + """Tests/tools: same key as ``/embed/clip_text`` lane; defaults to ``CONFIG.MULTIMODAL_MODEL_NAME``.""" | |
| 1523 | + return _mm_clip_text_cache_key( | |
| 1524 | + text, normalize=normalize, model_name=model_name or CONFIG.MULTIMODAL_MODEL_NAME | |
| 1525 | + ) | ... | ... |
scripts/create_tenant_index.sh
| ... | ... | @@ -0,0 +1,47 @@ |
| 1 | +"""Unit tests for embeddings/cache_keys.py (hashing long bodies).""" | |
| 2 | + | |
| 3 | +import hashlib | |
| 4 | + | |
| 5 | +from embeddings import cache_keys as ck | |
| 6 | + | |
| 7 | + | |
| 8 | +def test_stable_body_short_unchanged(): | |
| 9 | + s = "a" * ck.CACHE_KEY_RAW_BODY_MAX_CHARS | |
| 10 | + assert ck._stable_body_for_cache_key(s) == s | |
| 11 | + | |
| 12 | + | |
| 13 | +def test_stable_body_long_hashes(): | |
| 14 | + s = "a" * (ck.CACHE_KEY_RAW_BODY_MAX_CHARS + 1) | |
| 15 | + out = ck._stable_body_for_cache_key(s) | |
| 16 | + assert out == "h:sha256:" + hashlib.sha256(s.encode("utf-8")).hexdigest() | |
| 17 | + assert out.startswith("h:sha256:") | |
| 18 | + assert len(out) == len("h:sha256:") + 64 | |
| 19 | + | |
| 20 | + | |
| 21 | +def test_stable_body_utf8_counts_unicode_codepoints(): | |
| 22 | + # 2 codepoints, not 6 bytes — still short | |
| 23 | + s = "你好" | |
| 24 | + assert ck._stable_body_for_cache_key(s) == s | |
| 25 | + | |
| 26 | + | |
| 27 | +def test_build_text_cache_key_uses_digest_when_long(): | |
| 28 | + # Default max 256: 257 'x' -> digest | |
| 29 | + long_text = "x" * (ck.CACHE_KEY_RAW_BODY_MAX_CHARS + 1) | |
| 30 | + key = ck.build_text_cache_key(long_text, normalize=True) | |
| 31 | + assert key.startswith("embed:norm1:h:sha256:") | |
| 32 | + digest = hashlib.sha256(long_text.encode("utf-8")).hexdigest() | |
| 33 | + assert key == f"embed:norm1:h:sha256:{digest}" | |
| 34 | + | |
| 35 | + | |
| 36 | +def test_build_image_cache_key_uses_digest_when_long(): | |
| 37 | + url = "https://x.example/" + "y" * ck.CACHE_KEY_RAW_BODY_MAX_CHARS | |
| 38 | + key = ck.build_image_cache_key(url, normalize=True, model_name="CN-CLIP/ViT-H-14") | |
| 39 | + digest = hashlib.sha256(url.encode("utf-8")).hexdigest() | |
| 40 | + assert key == f"embed:CN-CLIP/ViT-H-14:txt:norm1:h:sha256:{digest}" | |
| 41 | + | |
| 42 | + | |
| 43 | +def test_build_clip_text_cache_key_uses_digest_when_long(): | |
| 44 | + t = "词" * (ck.CACHE_KEY_RAW_BODY_MAX_CHARS + 1) | |
| 45 | + key = ck.build_clip_text_cache_key(t, normalize=False, model_name="m") | |
| 46 | + digest = hashlib.sha256(t.encode("utf-8")).hexdigest() | |
| 47 | + assert key == f"embed:m:img:norm0:h:sha256:{digest}" | ... | ... |
tests/test_embedding_pipeline.py
| ... | ... | @@ -16,6 +16,7 @@ from embeddings.image_encoder import CLIPImageEncoder |
| 16 | 16 | from embeddings.text_embedding_tei import TEITextModel |
| 17 | 17 | from embeddings.bf16 import encode_embedding_for_redis |
| 18 | 18 | from embeddings.cache_keys import build_image_cache_key, build_text_cache_key |
| 19 | +from embeddings.config import CONFIG | |
| 19 | 20 | from query import QueryParser |
| 20 | 21 | from context.request_context import create_request_context, set_current_request_context, clear_current_request_context |
| 21 | 22 | |
| ... | ... | @@ -207,7 +208,9 @@ def test_image_embedding_encoder_cache_hit(monkeypatch): |
| 207 | 208 | fake_cache = _FakeEmbeddingCache() |
| 208 | 209 | cached = np.array([0.5, 0.6], dtype=np.float32) |
| 209 | 210 | url = "https://example.com/a.jpg" |
| 210 | - fake_cache.store[build_image_cache_key(url, normalize=True)] = cached | |
| 211 | + fake_cache.store[ | |
| 212 | + build_image_cache_key(url, normalize=True, model_name=CONFIG.MULTIMODAL_MODEL_NAME) | |
| 213 | + ] = cached | |
| 211 | 214 | monkeypatch.setattr("embeddings.image_encoder.RedisEmbeddingCache", lambda **kwargs: fake_cache) |
| 212 | 215 | |
| 213 | 216 | calls = {"count": 0} | ... | ... |