Commit b754fd41470f2a1dab70339d336383de0cd8af1c

Authored by tangwang
1 parent 16204531

图片向量化支持优先级参数

config/config.yaml
@@ -110,7 +110,7 @@ rerank: @@ -110,7 +110,7 @@ rerank:
110 services: 110 services:
111 translation: 111 translation:
112 service_url: "http://127.0.0.1:6006" 112 service_url: "http://127.0.0.1:6006"
113 - default_model: "llm" 113 + default_model: "nllb-200-distilled-600m"
114 default_scene: "general" 114 default_scene: "general"
115 timeout_sec: 10.0 115 timeout_sec: 10.0
116 cache: 116 cache:
1 1
2 2
3 3
4 -先阅读图片和文本embedding相关的代ç ï¼š  
5 -@embeddings/README.md @embeddings/server.py @docs/æœç´¢API对接指å—-07-å¾®æœåŠ¡æŽ¥å£ï¼ˆEmbedding-Reranker-Translation).md @embeddings/image_encoder.py @embeddings/text_encoder.py 4 +先阅读文本embedding相关的代ç ï¼š
  5 +@embeddings/README.md @embeddings/server.py @docs/æœç´¢API对接指å—-07-å¾®æœåŠ¡æŽ¥å£ï¼ˆEmbedding-Reranker-Translation).md @embeddings/text_encoder.py
6 ç›®å‰æœ‰TEXT_MAX_INFLIGHT / IMAGE_MAX_INFLIGHT 准入é™åˆ¶ï¼Œè¶…é™è¿”回过载状æ€ç ã€‚ 6 ç›®å‰æœ‰TEXT_MAX_INFLIGHT / IMAGE_MAX_INFLIGHT 准入é™åˆ¶ï¼Œè¶…é™è¿”回过载状æ€ç ã€‚
7 7
8 -embeddingæœåŠ¡ï¼ˆåŒ…æ‹¬å›¾ç‰‡å’Œæ–‡æœ¬çš„embeddingï¼‰ï¼Œè¦æ”¯æŒ priority æŸ¥è¯¢å‚æ•°ï¼Œpriority > 0:ä¸è®¡å…¥ä¸Šè¿° inflightã€ä¸ä¼šå› å‡†å…¥è¢«æ‹’ç»ï¼› 8 +文本embeddingæœåŠ¡ï¼Œè¦æ”¯æŒ priority æŸ¥è¯¢å‚æ•°ï¼Œpriority > 0:ä¸è®¡å…¥ä¸Šè¿° inflightã€ä¸ä¼šå› å‡†å…¥è¢«æ‹’ç»ï¼ˆå›¾ç‰‡embeddingä¸éœ€è¦æ”¯æŒï¼Œå› ä¸ºåªæœ‰ç¦»çº¿éœ€è¦ç”¨åˆ°å›¾ç‰‡embedding)
9 priority == 0(默认,适åˆåšç´¢å¼•之类的离线任务):ä»èµ°åŽŸæœ‰ TEXT_MAX_INFLIGHT / IMAGE_MAX_INFLIGHT 准入;超é™è¿”回过载状æ€ç ã€‚ 9 priority == 0(默认,适åˆåšç´¢å¼•之类的离线任务):ä»èµ°åŽŸæœ‰ TEXT_MAX_INFLIGHT / IMAGE_MAX_INFLIGHT 准入;超é™è¿”回过载状æ€ç ã€‚
10 priority > 0(或者==1)(适åˆåœ¨çº¿è¯·æ±‚):ä¸ä¼šå› å‡†å…¥è¢«æ‹’ç»ï¼Œä½†æ˜¯ä»ç„¶éœ€è¦å ç”¨inflight,这样ä¿è¯åœ¨çº¿è¯·æ±‚ä¸è¢«é™åˆ¶ï¼Œå¹¶ä¸”在线请求很多的时候å¯ä»¥æ‹’ç»æŽ‰ç¦»çº¿çš„è¯·æ±‚ã€‚ 10 priority > 0(或者==1)(适åˆåœ¨çº¿è¯·æ±‚):ä¸ä¼šå› å‡†å…¥è¢«æ‹’ç»ï¼Œä½†æ˜¯ä»ç„¶éœ€è¦å ç”¨inflight,这样ä¿è¯åœ¨çº¿è¯·æ±‚ä¸è¢«é™åˆ¶ï¼Œå¹¶ä¸”在线请求很多的时候å¯ä»¥æ‹’ç»æŽ‰ç¦»çº¿çš„è¯·æ±‚ã€‚
11 11
@@ -16,7 +16,6 @@ priority > 0(或者==1)(适åˆåœ¨çº¿è¯·æ±‚):ä¸ä¼šå› å‡†å…¥è¢«æ‹’ç»ï¼ @@ -16,7 +16,6 @@ priority > 0(或者==1)(适åˆåœ¨çº¿è¯·æ±‚):ä¸ä¼šå› å‡†å…¥è¢«æ‹’ç»ï¼
16 16
17 17
18 18
19 -  
20 é…ç½®ä½“ç³»çš„é‡æž„。 19 é…ç½®ä½“ç³»çš„é‡æž„。
21 20
22 Referring to @docs/config-system-review-and-redesign.md , most of the modifications have been completed. Could you conduct a review to check what else needs improvement in the configuration documentation system? Are there any outstanding issues? 21 Referring to @docs/config-system-review-and-redesign.md , most of the modifications have been completed. Could you conduct a review to check what else needs improvement in the configuration documentation system? Are there any outstanding issues?
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
@@ -38,6 +38,10 @@ @@ -38,6 +38,10 @@
38 - `TEXT_MAX_INFLIGHT` 38 - `TEXT_MAX_INFLIGHT`
39 - `IMAGE_MAX_INFLIGHT` 39 - `IMAGE_MAX_INFLIGHT`
40 - 当超过处理能力时,服务会直接返回过载错误,而不是无限排队。 40 - 当超过处理能力时,服务会直接返回过载错误,而不是无限排队。
  41 +- 文本与图片服务均支持 `priority` query 参数(图片不做队列插队,仅 admission 规则与文本一致):
  42 + - `priority=0`(默认):适合离线索引,仍分别受 `TEXT_MAX_INFLIGHT` / `IMAGE_MAX_INFLIGHT` admission control 约束。
  43 + - `priority>0`(建议在线请求用 `1`):不会因 admission control 被拒绝,但仍会占用对应 text/image 的 inflight。
  44 + - 文本服务端会优先处理高优先级文本请求;图片端不实现插队,顺序按请求到达处理即可。
41 - `GET /health` 会返回各自的 `limits`、`stats`、`cache_enabled` 等状态;`GET /ready` 用于就绪探针。 45 - `GET /health` 会返回各自的 `limits`、`stats`、`cache_enabled` 等状态;`GET /ready` 用于就绪探针。
42 46
43 #### 7.1.1 `POST /embed/text` — 文本向量化 47 #### 7.1.1 `POST /embed/text` — 文本向量化
@@ -59,11 +63,15 @@ @@ -59,11 +63,15 @@
59 **完整 curl 示例**: 63 **完整 curl 示例**:
60 64
61 ```bash 65 ```bash
62 -curl -X POST "http://localhost:6005/embed/text?normalize=true" \ 66 +curl -X POST "http://localhost:6005/embed/text?normalize=true&priority=1" \
63 -H "Content-Type: application/json" \ 67 -H "Content-Type: application/json" \
64 -d '["芭比娃娃 儿童玩具", "纯棉T恤 短袖"]' 68 -d '["芭比娃娃 儿童玩具", "纯棉T恤 短袖"]'
65 ``` 69 ```
66 70
  71 +说明:
  72 +- 在线 query / 实时请求:建议显式传 `priority=1`
  73 +- 离线索引 / 批量回填:保持默认 `priority=0` 即可
  74 +
67 #### 7.1.2 `POST /embed/image` — 图片向量化 75 #### 7.1.2 `POST /embed/image` — 图片向量化
68 76
69 将图片 URL 或路径转为向量,用于以图搜图。 77 将图片 URL 或路径转为向量,用于以图搜图。
@@ -85,11 +93,13 @@ curl -X POST "http://localhost:6005/embed/text?normalize=true" \ @@ -85,11 +93,13 @@ curl -X POST "http://localhost:6005/embed/text?normalize=true" \
85 **完整 curl 示例**: 93 **完整 curl 示例**:
86 94
87 ```bash 95 ```bash
88 -curl -X POST "http://localhost:6008/embed/image?normalize=true" \ 96 +curl -X POST "http://localhost:6008/embed/image?normalize=true&priority=1" \
89 -H "Content-Type: application/json" \ 97 -H "Content-Type: application/json" \
90 -d '["https://oss.essa.cn/98532128-cf8e-456c-9e30-6f2a5ea0c19f.jpg"]' 98 -d '["https://oss.essa.cn/98532128-cf8e-456c-9e30-6f2a5ea0c19f.jpg"]'
91 ``` 99 ```
92 100
  101 +在线以图搜图等实时场景可传 `priority=1`;离线索引回填保持默认 `priority=0`。
  102 +
93 #### 7.1.3 `GET /health` — 健康检查 103 #### 7.1.3 `GET /health` — 健康检查
94 104
95 ```bash 105 ```bash
@@ -118,6 +128,8 @@ curl "http://localhost:6008/ready" @@ -118,6 +128,8 @@ curl "http://localhost:6008/ready"
118 - cache key 已区分 `normalize=true/false`,避免不同归一化策略命中同一条缓存。 128 - cache key 已区分 `normalize=true/false`,避免不同归一化策略命中同一条缓存。
119 - 当服务端发现请求是 **full-cache-hit** 时,会直接返回,不占用模型并发槽位。 129 - 当服务端发现请求是 **full-cache-hit** 时,会直接返回,不占用模型并发槽位。
120 - 当服务端发现超过 `TEXT_MAX_INFLIGHT` / `IMAGE_MAX_INFLIGHT` 时,会直接拒绝,而不是无限排队。 130 - 当服务端发现超过 `TEXT_MAX_INFLIGHT` / `IMAGE_MAX_INFLIGHT` 时,会直接拒绝,而不是无限排队。
  131 +- 其中 `POST /embed/text` 的 `priority=0` 会按上面的 inflight 规则直接拒绝;`priority>0` 不会被 admission 拒绝,但仍计入 inflight,并在服务端排队时优先于 `priority=0` 请求。
  132 +- `POST /embed/image` 的 `priority=0` 受 `IMAGE_MAX_INFLIGHT` 约束;`priority>0` 不会被 admission 拒绝,但仍计入 inflight(无插队)。
121 133
122 #### 7.1.6 TEI 统一调优建议(主服务) 134 #### 7.1.6 TEI 统一调优建议(主服务)
123 135
embeddings/README.md
@@ -30,13 +30,13 @@ @@ -30,13 +30,13 @@
30 - 文本服务(默认 `6005`) 30 - 文本服务(默认 `6005`)
31 - `POST /embed/text` 31 - `POST /embed/text`
32 - 请求体:`["文本1", "文本2", ...]` 32 - 请求体:`["文本1", "文本2", ...]`
33 - - 可选 query 参数:`normalize=true|false` 33 + - 可选 query 参数:`normalize=true|false`、`priority=0|1`
34 - 返回:`[[...], [...], ...]` 34 - 返回:`[[...], [...], ...]`
35 - 健康接口:`GET /health`、`GET /ready` 35 - 健康接口:`GET /health`、`GET /ready`
36 - 图片服务(默认 `6008`) 36 - 图片服务(默认 `6008`)
37 - `POST /embed/image` 37 - `POST /embed/image`
38 - 请求体:`["url或本地路径1", ...]` 38 - 请求体:`["url或本地路径1", ...]`
39 - - 可选 query 参数:`normalize=true|false` 39 + - 可选 query 参数:`normalize=true|false`、`priority=0|1`
40 - 返回:`[[...], [...], ...]` 40 - 返回:`[[...], [...], ...]`
41 - 健康接口:`GET /health`、`GET /ready` 41 - 健康接口:`GET /health`、`GET /ready`
42 42
@@ -61,6 +61,11 @@ @@ -61,6 +61,11 @@
61 - 图片服务可以配置得比文本更严格。 61 - 图片服务可以配置得比文本更严格。
62 - 请求若是 full-cache-hit,会在服务端直接返回,不占用模型并发槽位。 62 - 请求若是 full-cache-hit,会在服务端直接返回,不占用模型并发槽位。
63 - 超过处理能力时直接拒绝,比无限排队更稳定。 63 - 超过处理能力时直接拒绝,比无限排队更稳定。
  64 +- 文本服务支持 `priority`:
  65 + - `priority=0`(默认,适合离线索引)仍受 `TEXT_MAX_INFLIGHT` 限制,超限直接返回 overload。
  66 + - `priority>0`(建议在线 query 用 `1`)不会因 admission control 被拒绝,但仍会计入 inflight。
  67 + - 文本服务内部使用双队列调度,处理时会优先消费高优先级请求,避免在线请求长期排在离线批量任务后面。
  68 +- 图片服务同样支持 `priority`(语义与文本一致,按 `IMAGE_MAX_INFLIGHT` 计数;不做队列插队,仅 admission 规则不同)。
64 69
65 ### 图片向量:clip-as-service(推荐) 70 ### 图片向量:clip-as-service(推荐)
66 71
@@ -86,6 +91,14 @@ @@ -86,6 +91,14 @@
86 - `CLIP_AS_SERVICE_MODEL_NAME=CN-CLIP/ViT-L-14` 91 - `CLIP_AS_SERVICE_MODEL_NAME=CN-CLIP/ViT-L-14`
87 - `scripts/start_cnclip_service.sh` 默认会读取同一个 `CLIP_AS_SERVICE_MODEL_NAME`,也可用 `CNCLIP_MODEL_NAME` 或 `--model-name` 临时覆盖 92 - `scripts/start_cnclip_service.sh` 默认会读取同一个 `CLIP_AS_SERVICE_MODEL_NAME`,也可用 `CNCLIP_MODEL_NAME` 或 `--model-name` 临时覆盖
88 93
  94 +### 性能与压测(沿用仓库脚本)
  95 +
  96 +- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`scripts/perf_api_benchmark.py`
  97 + - 示例:`python scripts/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
  98 + - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1`
  99 + - 自定义请求模板:`--cases-file scripts/perf_cases.json.example`
  100 +- 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。
  101 +
89 ### 启动服务 102 ### 启动服务
90 103
91 使用仓库脚本启动: 104 使用仓库脚本启动:
embeddings/image_encoder.py
@@ -35,7 +35,12 @@ class CLIPImageEncoder: @@ -35,7 +35,12 @@ class CLIPImageEncoder:
35 namespace="image", 35 namespace="image",
36 ) 36 )
37 37
38 - def _call_service(self, request_data: List[str], normalize_embeddings: bool = True) -> List[Any]: 38 + def _call_service(
  39 + self,
  40 + request_data: List[str],
  41 + normalize_embeddings: bool = True,
  42 + priority: int = 0,
  43 + ) -> List[Any]:
39 """ 44 """
40 Call the embedding service API. 45 Call the embedding service API.
41 46
@@ -48,7 +53,10 @@ class CLIPImageEncoder: @@ -48,7 +53,10 @@ class CLIPImageEncoder:
48 try: 53 try:
49 response = requests.post( 54 response = requests.post(
50 self.endpoint, 55 self.endpoint,
51 - params={"normalize": "true" if normalize_embeddings else "false"}, 56 + params={
  57 + "normalize": "true" if normalize_embeddings else "false",
  58 + "priority": max(0, int(priority)),
  59 + },
52 json=request_data, 60 json=request_data,
53 timeout=60 61 timeout=60
54 ) 62 )
@@ -66,7 +74,12 @@ class CLIPImageEncoder: @@ -66,7 +74,12 @@ class CLIPImageEncoder:
66 """ 74 """
67 raise NotImplementedError("encode_image with PIL Image is not supported by embedding service") 75 raise NotImplementedError("encode_image with PIL Image is not supported by embedding service")
68 76
69 - def encode_image_from_url(self, url: str, normalize_embeddings: bool = True) -> np.ndarray: 77 + def encode_image_from_url(
  78 + self,
  79 + url: str,
  80 + normalize_embeddings: bool = True,
  81 + priority: int = 0,
  82 + ) -> np.ndarray:
70 """ 83 """
71 Generate image embedding via network service using URL. 84 Generate image embedding via network service using URL.
72 85
@@ -81,7 +94,11 @@ class CLIPImageEncoder: @@ -81,7 +94,11 @@ class CLIPImageEncoder:
81 if cached is not None: 94 if cached is not None:
82 return cached 95 return cached
83 96
84 - response_data = self._call_service([url], normalize_embeddings=normalize_embeddings) 97 + response_data = self._call_service(
  98 + [url],
  99 + normalize_embeddings=normalize_embeddings,
  100 + priority=priority,
  101 + )
85 if not response_data or len(response_data) != 1 or response_data[0] is None: 102 if not response_data or len(response_data) != 1 or response_data[0] is None:
86 raise RuntimeError(f"No image embedding returned for URL: {url}") 103 raise RuntimeError(f"No image embedding returned for URL: {url}")
87 vec = np.array(response_data[0], dtype=np.float32) 104 vec = np.array(response_data[0], dtype=np.float32)
@@ -95,6 +112,7 @@ class CLIPImageEncoder: @@ -95,6 +112,7 @@ class CLIPImageEncoder:
95 images: List[Union[str, Image.Image]], 112 images: List[Union[str, Image.Image]],
96 batch_size: int = 8, 113 batch_size: int = 8,
97 normalize_embeddings: bool = True, 114 normalize_embeddings: bool = True,
  115 + priority: int = 0,
98 ) -> List[np.ndarray]: 116 ) -> List[np.ndarray]:
99 """ 117 """
100 Encode a batch of images efficiently via network service. 118 Encode a batch of images efficiently via network service.
@@ -129,7 +147,11 @@ class CLIPImageEncoder: @@ -129,7 +147,11 @@ class CLIPImageEncoder:
129 147
130 for i in range(0, len(pending_urls), batch_size): 148 for i in range(0, len(pending_urls), batch_size):
131 batch_urls = pending_urls[i : i + batch_size] 149 batch_urls = pending_urls[i : i + batch_size]
132 - response_data = self._call_service(batch_urls, normalize_embeddings=normalize_embeddings) 150 + response_data = self._call_service(
  151 + batch_urls,
  152 + normalize_embeddings=normalize_embeddings,
  153 + priority=priority,
  154 + )
133 if not response_data or len(response_data) != len(batch_urls): 155 if not response_data or len(response_data) != len(batch_urls):
134 raise RuntimeError( 156 raise RuntimeError(
135 f"Image embedding response length mismatch: expected {len(batch_urls)}, " 157 f"Image embedding response length mismatch: expected {len(batch_urls)}, "
@@ -153,6 +175,7 @@ class CLIPImageEncoder: @@ -153,6 +175,7 @@ class CLIPImageEncoder:
153 urls: List[str], 175 urls: List[str],
154 batch_size: Optional[int] = None, 176 batch_size: Optional[int] = None,
155 normalize_embeddings: bool = True, 177 normalize_embeddings: bool = True,
  178 + priority: int = 0,
156 ) -> List[np.ndarray]: 179 ) -> List[np.ndarray]:
157 """ 180 """
158 与 ClipImageModel / ClipAsServiceImageEncoder 一致的接口,供索引器 document_transformer 调用。 181 与 ClipImageModel / ClipAsServiceImageEncoder 一致的接口,供索引器 document_transformer 调用。
@@ -168,4 +191,5 @@ class CLIPImageEncoder: @@ -168,4 +191,5 @@ class CLIPImageEncoder:
168 urls, 191 urls,
169 batch_size=batch_size or 8, 192 batch_size=batch_size or 8,
170 normalize_embeddings=normalize_embeddings, 193 normalize_embeddings=normalize_embeddings,
  194 + priority=priority,
171 ) 195 )
embeddings/server.py
@@ -206,23 +206,24 @@ class _InflightLimiter: @@ -206,23 +206,24 @@ class _InflightLimiter:
206 def __init__(self, name: str, limit: int): 206 def __init__(self, name: str, limit: int):
207 self.name = name 207 self.name = name
208 self.limit = max(1, int(limit)) 208 self.limit = max(1, int(limit))
209 - self._sem = threading.BoundedSemaphore(self.limit)  
210 self._lock = threading.Lock() 209 self._lock = threading.Lock()
211 self._active = 0 210 self._active = 0
212 self._rejected = 0 211 self._rejected = 0
213 self._completed = 0 212 self._completed = 0
214 self._failed = 0 213 self._failed = 0
215 self._max_active = 0 214 self._max_active = 0
  215 + self._priority_bypass_total = 0
216 216
217 - def try_acquire(self) -> tuple[bool, int]:  
218 - if not self._sem.acquire(blocking=False):  
219 - with self._lock: 217 + def try_acquire(self, *, bypass_limit: bool = False) -> tuple[bool, int]:
  218 + with self._lock:
  219 + if not bypass_limit and self._active >= self.limit:
220 self._rejected += 1 220 self._rejected += 1
221 active = self._active 221 active = self._active
222 - return False, active  
223 - with self._lock: 222 + return False, active
224 self._active += 1 223 self._active += 1
225 self._max_active = max(self._max_active, self._active) 224 self._max_active = max(self._max_active, self._active)
  225 + if bypass_limit:
  226 + self._priority_bypass_total += 1
226 active = self._active 227 active = self._active
227 return True, active 228 return True, active
228 229
@@ -234,7 +235,6 @@ class _InflightLimiter: @@ -234,7 +235,6 @@ class _InflightLimiter:
234 else: 235 else:
235 self._failed += 1 236 self._failed += 1
236 active = self._active 237 active = self._active
237 - self._sem.release()  
238 return active 238 return active
239 239
240 def snapshot(self) -> Dict[str, int]: 240 def snapshot(self) -> Dict[str, int]:
@@ -246,9 +246,157 @@ class _InflightLimiter: @@ -246,9 +246,157 @@ class _InflightLimiter:
246 "completed_total": self._completed, 246 "completed_total": self._completed,
247 "failed_total": self._failed, 247 "failed_total": self._failed,
248 "max_active": self._max_active, 248 "max_active": self._max_active,
  249 + "priority_bypass_total": self._priority_bypass_total,
249 } 250 }
250 251
251 252
  253 +def _effective_priority(priority: int) -> int:
  254 + return 1 if int(priority) > 0 else 0
  255 +
  256 +
  257 +def _priority_label(priority: int) -> str:
  258 + return "high" if _effective_priority(priority) > 0 else "normal"
  259 +
  260 +
  261 +@dataclass
  262 +class _TextDispatchTask:
  263 + normalized: List[str]
  264 + effective_normalize: bool
  265 + request_id: str
  266 + priority: int
  267 + created_at: float
  268 + done: threading.Event
  269 + result: Optional[_EmbedResult] = None
  270 + error: Optional[Exception] = None
  271 +
  272 +
  273 +_text_dispatch_high_queue: "deque[_TextDispatchTask]" = deque()
  274 +_text_dispatch_normal_queue: "deque[_TextDispatchTask]" = deque()
  275 +_text_dispatch_cv = threading.Condition()
  276 +_text_dispatch_workers: List[threading.Thread] = []
  277 +_text_dispatch_worker_stop = False
  278 +_text_dispatch_worker_count = 0
  279 +
  280 +
  281 +def _text_dispatch_queue_depth() -> Dict[str, int]:
  282 + with _text_dispatch_cv:
  283 + return {
  284 + "high": len(_text_dispatch_high_queue),
  285 + "normal": len(_text_dispatch_normal_queue),
  286 + "total": len(_text_dispatch_high_queue) + len(_text_dispatch_normal_queue),
  287 + }
  288 +
  289 +
  290 +def _pop_text_dispatch_task_locked() -> Optional["_TextDispatchTask"]:
  291 + if _text_dispatch_high_queue:
  292 + return _text_dispatch_high_queue.popleft()
  293 + if _text_dispatch_normal_queue:
  294 + return _text_dispatch_normal_queue.popleft()
  295 + return None
  296 +
  297 +
  298 +def _start_text_dispatch_workers() -> None:
  299 + global _text_dispatch_workers, _text_dispatch_worker_stop, _text_dispatch_worker_count
  300 + if _text_model is None:
  301 + return
  302 + target_worker_count = 1 if _text_backend_name == "local_st" else _TEXT_MAX_INFLIGHT
  303 + alive_workers = [worker for worker in _text_dispatch_workers if worker.is_alive()]
  304 + if len(alive_workers) == target_worker_count:
  305 + _text_dispatch_workers = alive_workers
  306 + _text_dispatch_worker_count = target_worker_count
  307 + return
  308 + _text_dispatch_worker_stop = False
  309 + _text_dispatch_worker_count = target_worker_count
  310 + _text_dispatch_workers = []
  311 + for idx in range(target_worker_count):
  312 + worker = threading.Thread(
  313 + target=_text_dispatch_worker_loop,
  314 + args=(idx,),
  315 + name=f"embed-text-dispatch-{idx}",
  316 + daemon=True,
  317 + )
  318 + worker.start()
  319 + _text_dispatch_workers.append(worker)
  320 + logger.info(
  321 + "Started text dispatch workers | backend=%s workers=%d",
  322 + _text_backend_name,
  323 + target_worker_count,
  324 + )
  325 +
  326 +
  327 +def _stop_text_dispatch_workers() -> None:
  328 + global _text_dispatch_worker_stop
  329 + with _text_dispatch_cv:
  330 + _text_dispatch_worker_stop = True
  331 + _text_dispatch_cv.notify_all()
  332 +
  333 +
  334 +def _text_dispatch_worker_loop(worker_idx: int) -> None:
  335 + while True:
  336 + with _text_dispatch_cv:
  337 + while (
  338 + not _text_dispatch_high_queue
  339 + and not _text_dispatch_normal_queue
  340 + and not _text_dispatch_worker_stop
  341 + ):
  342 + _text_dispatch_cv.wait()
  343 + if _text_dispatch_worker_stop:
  344 + return
  345 + task = _pop_text_dispatch_task_locked()
  346 + if task is None:
  347 + continue
  348 + try:
  349 + queue_wait_ms = (time.perf_counter() - task.created_at) * 1000.0
  350 + logger.info(
  351 + "text dispatch start | worker=%d priority=%s inputs=%d queue_wait_ms=%.2f",
  352 + worker_idx,
  353 + _priority_label(task.priority),
  354 + len(task.normalized),
  355 + queue_wait_ms,
  356 + extra=_request_log_extra(task.request_id),
  357 + )
  358 + task.result = _embed_text_impl(
  359 + task.normalized,
  360 + task.effective_normalize,
  361 + task.request_id,
  362 + task.priority,
  363 + )
  364 + except Exception as exc:
  365 + task.error = exc
  366 + finally:
  367 + task.done.set()
  368 +
  369 +
  370 +def _submit_text_dispatch_and_wait(
  371 + normalized: List[str],
  372 + effective_normalize: bool,
  373 + request_id: str,
  374 + priority: int,
  375 +) -> _EmbedResult:
  376 + if not any(worker.is_alive() for worker in _text_dispatch_workers):
  377 + _start_text_dispatch_workers()
  378 + task = _TextDispatchTask(
  379 + normalized=normalized,
  380 + effective_normalize=effective_normalize,
  381 + request_id=request_id,
  382 + priority=_effective_priority(priority),
  383 + created_at=time.perf_counter(),
  384 + done=threading.Event(),
  385 + )
  386 + with _text_dispatch_cv:
  387 + if task.priority > 0:
  388 + _text_dispatch_high_queue.append(task)
  389 + else:
  390 + _text_dispatch_normal_queue.append(task)
  391 + _text_dispatch_cv.notify()
  392 + task.done.wait()
  393 + if task.error is not None:
  394 + raise task.error
  395 + if task.result is None:
  396 + raise RuntimeError("Text dispatch worker returned empty result")
  397 + return task.result
  398 +
  399 +
252 _text_request_limiter = _InflightLimiter(name="text", limit=_TEXT_MAX_INFLIGHT) 400 _text_request_limiter = _InflightLimiter(name="text", limit=_TEXT_MAX_INFLIGHT)
253 _image_request_limiter = _InflightLimiter(name="image", limit=_IMAGE_MAX_INFLIGHT) 401 _image_request_limiter = _InflightLimiter(name="image", limit=_IMAGE_MAX_INFLIGHT)
254 _text_stats = _EndpointStats(name="text") 402 _text_stats = _EndpointStats(name="text")
@@ -261,6 +409,7 @@ _image_cache = RedisEmbeddingCache(key_prefix=_CACHE_PREFIX, namespace="image") @@ -261,6 +409,7 @@ _image_cache = RedisEmbeddingCache(key_prefix=_CACHE_PREFIX, namespace="image")
261 class _SingleTextTask: 409 class _SingleTextTask:
262 text: str 410 text: str
263 normalize: bool 411 normalize: bool
  412 + priority: int
264 created_at: float 413 created_at: float
265 request_id: str 414 request_id: str
266 done: threading.Event 415 done: threading.Event
@@ -268,12 +417,30 @@ class _SingleTextTask: @@ -268,12 +417,30 @@ class _SingleTextTask:
268 error: Optional[Exception] = None 417 error: Optional[Exception] = None
269 418
270 419
271 -_text_single_queue: "deque[_SingleTextTask]" = deque() 420 +_text_single_high_queue: "deque[_SingleTextTask]" = deque()
  421 +_text_single_normal_queue: "deque[_SingleTextTask]" = deque()
272 _text_single_queue_cv = threading.Condition() 422 _text_single_queue_cv = threading.Condition()
273 _text_batch_worker: Optional[threading.Thread] = None 423 _text_batch_worker: Optional[threading.Thread] = None
274 _text_batch_worker_stop = False 424 _text_batch_worker_stop = False
275 425
276 426
  427 +def _text_microbatch_queue_depth() -> Dict[str, int]:
  428 + with _text_single_queue_cv:
  429 + return {
  430 + "high": len(_text_single_high_queue),
  431 + "normal": len(_text_single_normal_queue),
  432 + "total": len(_text_single_high_queue) + len(_text_single_normal_queue),
  433 + }
  434 +
  435 +
  436 +def _pop_single_text_task_locked() -> Optional["_SingleTextTask"]:
  437 + if _text_single_high_queue:
  438 + return _text_single_high_queue.popleft()
  439 + if _text_single_normal_queue:
  440 + return _text_single_normal_queue.popleft()
  441 + return None
  442 +
  443 +
277 def _compact_preview(text: str, max_chars: int) -> str: 444 def _compact_preview(text: str, max_chars: int) -> str:
278 compact = " ".join((text or "").split()) 445 compact = " ".join((text or "").split())
279 if len(compact) <= max_chars: 446 if len(compact) <= max_chars:
@@ -356,30 +523,41 @@ def _text_batch_worker_loop() -&gt; None: @@ -356,30 +523,41 @@ def _text_batch_worker_loop() -&gt; None:
356 max_batch = max(1, int(CONFIG.TEXT_BATCH_SIZE)) 523 max_batch = max(1, int(CONFIG.TEXT_BATCH_SIZE))
357 while True: 524 while True:
358 with _text_single_queue_cv: 525 with _text_single_queue_cv:
359 - while not _text_single_queue and not _text_batch_worker_stop: 526 + while (
  527 + not _text_single_high_queue
  528 + and not _text_single_normal_queue
  529 + and not _text_batch_worker_stop
  530 + ):
360 _text_single_queue_cv.wait() 531 _text_single_queue_cv.wait()
361 if _text_batch_worker_stop: 532 if _text_batch_worker_stop:
362 return 533 return
363 534
364 - batch: List[_SingleTextTask] = [_text_single_queue.popleft()] 535 + first_task = _pop_single_text_task_locked()
  536 + if first_task is None:
  537 + continue
  538 + batch: List[_SingleTextTask] = [first_task]
365 deadline = time.perf_counter() + _TEXT_MICROBATCH_WINDOW_SEC 539 deadline = time.perf_counter() + _TEXT_MICROBATCH_WINDOW_SEC
366 540
367 while len(batch) < max_batch: 541 while len(batch) < max_batch:
368 remaining = deadline - time.perf_counter() 542 remaining = deadline - time.perf_counter()
369 if remaining <= 0: 543 if remaining <= 0:
370 break 544 break
371 - if not _text_single_queue: 545 + if not _text_single_high_queue and not _text_single_normal_queue:
372 _text_single_queue_cv.wait(timeout=remaining) 546 _text_single_queue_cv.wait(timeout=remaining)
373 continue 547 continue
374 - while _text_single_queue and len(batch) < max_batch:  
375 - batch.append(_text_single_queue.popleft()) 548 + while len(batch) < max_batch:
  549 + next_task = _pop_single_text_task_locked()
  550 + if next_task is None:
  551 + break
  552 + batch.append(next_task)
376 553
377 try: 554 try:
378 queue_wait_ms = [(time.perf_counter() - task.created_at) * 1000.0 for task in batch] 555 queue_wait_ms = [(time.perf_counter() - task.created_at) * 1000.0 for task in batch]
379 reqids = [task.request_id for task in batch] 556 reqids = [task.request_id for task in batch]
380 logger.info( 557 logger.info(
381 - "text microbatch dispatch | size=%d queue_wait_ms_min=%.2f queue_wait_ms_max=%.2f reqids=%s preview=%s", 558 + "text microbatch dispatch | size=%d priority=%s queue_wait_ms_min=%.2f queue_wait_ms_max=%.2f reqids=%s preview=%s",
382 len(batch), 559 len(batch),
  560 + _priority_label(max(task.priority for task in batch)),
383 min(queue_wait_ms) if queue_wait_ms else 0.0, 561 min(queue_wait_ms) if queue_wait_ms else 0.0,
384 max(queue_wait_ms) if queue_wait_ms else 0.0, 562 max(queue_wait_ms) if queue_wait_ms else 0.0,
385 reqids, 563 reqids,
@@ -423,22 +601,32 @@ def _text_batch_worker_loop() -&gt; None: @@ -423,22 +601,32 @@ def _text_batch_worker_loop() -&gt; None:
423 task.done.set() 601 task.done.set()
424 602
425 603
426 -def _encode_single_text_with_microbatch(text: str, normalize: bool, request_id: str) -> List[float]: 604 +def _encode_single_text_with_microbatch(
  605 + text: str,
  606 + normalize: bool,
  607 + request_id: str,
  608 + priority: int,
  609 +) -> List[float]:
427 task = _SingleTextTask( 610 task = _SingleTextTask(
428 text=text, 611 text=text,
429 normalize=normalize, 612 normalize=normalize,
  613 + priority=_effective_priority(priority),
430 created_at=time.perf_counter(), 614 created_at=time.perf_counter(),
431 request_id=request_id, 615 request_id=request_id,
432 done=threading.Event(), 616 done=threading.Event(),
433 ) 617 )
434 with _text_single_queue_cv: 618 with _text_single_queue_cv:
435 - _text_single_queue.append(task) 619 + if task.priority > 0:
  620 + _text_single_high_queue.append(task)
  621 + else:
  622 + _text_single_normal_queue.append(task)
436 _text_single_queue_cv.notify() 623 _text_single_queue_cv.notify()
437 624
438 if not task.done.wait(timeout=_TEXT_REQUEST_TIMEOUT_SEC): 625 if not task.done.wait(timeout=_TEXT_REQUEST_TIMEOUT_SEC):
439 with _text_single_queue_cv: 626 with _text_single_queue_cv:
  627 + queue = _text_single_high_queue if task.priority > 0 else _text_single_normal_queue
440 try: 628 try:
441 - _text_single_queue.remove(task) 629 + queue.remove(task)
442 except ValueError: 630 except ValueError:
443 pass 631 pass
444 raise RuntimeError( 632 raise RuntimeError(
@@ -489,6 +677,7 @@ def load_models(): @@ -489,6 +677,7 @@ def load_models():
489 f"Unsupported embedding backend: {backend_name}. " 677 f"Unsupported embedding backend: {backend_name}. "
490 "Supported: tei, local_st" 678 "Supported: tei, local_st"
491 ) 679 )
  680 + _start_text_dispatch_workers()
492 logger.info("Text backend loaded successfully: %s", _text_backend_name) 681 logger.info("Text backend loaded successfully: %s", _text_backend_name)
493 except Exception as e: 682 except Exception as e:
494 logger.error("Failed to load text model: %s", e, exc_info=True) 683 logger.error("Failed to load text model: %s", e, exc_info=True)
@@ -532,6 +721,7 @@ def load_models(): @@ -532,6 +721,7 @@ def load_models():
532 @app.on_event("shutdown") 721 @app.on_event("shutdown")
533 def stop_workers() -> None: 722 def stop_workers() -> None:
534 _stop_text_batch_worker() 723 _stop_text_batch_worker()
  724 + _stop_text_dispatch_workers()
535 725
536 726
537 def _normalize_vector(vec: np.ndarray) -> np.ndarray: 727 def _normalize_vector(vec: np.ndarray) -> np.ndarray:
@@ -602,6 +792,8 @@ def _try_full_image_cache_hit( @@ -602,6 +792,8 @@ def _try_full_image_cache_hit(
602 def health() -> Dict[str, Any]: 792 def health() -> Dict[str, Any]:
603 """Health check endpoint. Returns status and current throttling stats.""" 793 """Health check endpoint. Returns status and current throttling stats."""
604 ready = (not open_text_model or _text_model is not None) and (not open_image_model or _image_model is not None) 794 ready = (not open_text_model or _text_model is not None) and (not open_image_model or _image_model is not None)
  795 + text_dispatch_depth = _text_dispatch_queue_depth()
  796 + text_microbatch_depth = _text_microbatch_queue_depth()
605 return { 797 return {
606 "status": "ok" if ready else "degraded", 798 "status": "ok" if ready else "degraded",
607 "service_kind": _SERVICE_KIND, 799 "service_kind": _SERVICE_KIND,
@@ -620,9 +812,18 @@ def health() -&gt; Dict[str, Any]: @@ -620,9 +812,18 @@ def health() -&gt; Dict[str, Any]:
620 "text": _text_stats.snapshot(), 812 "text": _text_stats.snapshot(),
621 "image": _image_stats.snapshot(), 813 "image": _image_stats.snapshot(),
622 }, 814 },
  815 + "text_dispatch": {
  816 + "workers": _text_dispatch_worker_count,
  817 + "workers_alive": sum(1 for worker in _text_dispatch_workers if worker.is_alive()),
  818 + "queue_depth": text_dispatch_depth["total"],
  819 + "queue_depth_high": text_dispatch_depth["high"],
  820 + "queue_depth_normal": text_dispatch_depth["normal"],
  821 + },
623 "text_microbatch": { 822 "text_microbatch": {
624 "window_ms": round(_TEXT_MICROBATCH_WINDOW_SEC * 1000.0, 3), 823 "window_ms": round(_TEXT_MICROBATCH_WINDOW_SEC * 1000.0, 3),
625 - "queue_depth": len(_text_single_queue), 824 + "queue_depth": text_microbatch_depth["total"],
  825 + "queue_depth_high": text_microbatch_depth["high"],
  826 + "queue_depth_normal": text_microbatch_depth["normal"],
626 "worker_alive": bool(_text_batch_worker is not None and _text_batch_worker.is_alive()), 827 "worker_alive": bool(_text_batch_worker is not None and _text_batch_worker.is_alive()),
627 "request_timeout_sec": _TEXT_REQUEST_TIMEOUT_SEC, 828 "request_timeout_sec": _TEXT_REQUEST_TIMEOUT_SEC,
628 }, 829 },
@@ -654,6 +855,7 @@ def _embed_text_impl( @@ -654,6 +855,7 @@ def _embed_text_impl(
654 normalized: List[str], 855 normalized: List[str],
655 effective_normalize: bool, 856 effective_normalize: bool,
656 request_id: str, 857 request_id: str,
  858 + priority: int = 0,
657 ) -> _EmbedResult: 859 ) -> _EmbedResult:
658 if _text_model is None: 860 if _text_model is None:
659 raise RuntimeError("Text model not loaded") 861 raise RuntimeError("Text model not loaded")
@@ -703,6 +905,7 @@ def _embed_text_impl( @@ -703,6 +905,7 @@ def _embed_text_impl(
703 missing_texts[0], 905 missing_texts[0],
704 normalize=effective_normalize, 906 normalize=effective_normalize,
705 request_id=request_id, 907 request_id=request_id,
  908 + priority=priority,
706 ) 909 )
707 ] 910 ]
708 mode = "microbatch-single" 911 mode = "microbatch-single"
@@ -777,6 +980,7 @@ async def embed_text( @@ -777,6 +980,7 @@ async def embed_text(
777 http_request: Request, 980 http_request: Request,
778 response: Response, 981 response: Response,
779 normalize: Optional[bool] = None, 982 normalize: Optional[bool] = None,
  983 + priority: int = 0,
780 ) -> List[Optional[List[float]]]: 984 ) -> List[Optional[List[float]]]:
781 if _text_model is None: 985 if _text_model is None:
782 raise HTTPException(status_code=503, detail="Text embedding model not loaded in this service") 986 raise HTTPException(status_code=503, detail="Text embedding model not loaded in this service")
@@ -784,6 +988,9 @@ async def embed_text( @@ -784,6 +988,9 @@ async def embed_text(
784 request_id = _resolve_request_id(http_request) 988 request_id = _resolve_request_id(http_request)
785 response.headers["X-Request-ID"] = request_id 989 response.headers["X-Request-ID"] = request_id
786 990
  991 + if priority < 0:
  992 + raise HTTPException(status_code=400, detail="priority must be >= 0")
  993 + effective_priority = _effective_priority(priority)
787 effective_normalize = bool(CONFIG.TEXT_NORMALIZE_EMBEDDINGS) if normalize is None else bool(normalize) 994 effective_normalize = bool(CONFIG.TEXT_NORMALIZE_EMBEDDINGS) if normalize is None else bool(normalize)
788 normalized: List[str] = [] 995 normalized: List[str] = []
789 for i, t in enumerate(texts): 996 for i, t in enumerate(texts):
@@ -806,8 +1013,9 @@ async def embed_text( @@ -806,8 +1013,9 @@ async def embed_text(
806 cache_misses=0, 1013 cache_misses=0,
807 ) 1014 )
808 logger.info( 1015 logger.info(
809 - "embed_text response | backend=%s mode=cache-only inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=0 first_vector=%s latency_ms=%.2f", 1016 + "embed_text response | backend=%s mode=cache-only priority=%s inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=0 first_vector=%s latency_ms=%.2f",
810 _text_backend_name, 1017 _text_backend_name,
  1018 + _priority_label(effective_priority),
811 len(normalized), 1019 len(normalized),
812 effective_normalize, 1020 effective_normalize,
813 len(cache_only.vectors[0]) if cache_only.vectors and cache_only.vectors[0] is not None else 0, 1021 len(cache_only.vectors[0]) if cache_only.vectors and cache_only.vectors[0] is not None else 0,
@@ -818,13 +1026,14 @@ async def embed_text( @@ -818,13 +1026,14 @@ async def embed_text(
818 ) 1026 )
819 return cache_only.vectors 1027 return cache_only.vectors
820 1028
821 - accepted, active = _text_request_limiter.try_acquire() 1029 + accepted, active = _text_request_limiter.try_acquire(bypass_limit=effective_priority > 0)
822 if not accepted: 1030 if not accepted:
823 _text_stats.record_rejected() 1031 _text_stats.record_rejected()
824 logger.warning( 1032 logger.warning(
825 - "embed_text rejected | client=%s backend=%s inputs=%d normalize=%s active=%d limit=%d preview=%s", 1033 + "embed_text rejected | client=%s backend=%s priority=%s inputs=%d normalize=%s active=%d limit=%d preview=%s",
826 _request_client(http_request), 1034 _request_client(http_request),
827 _text_backend_name, 1035 _text_backend_name,
  1036 + _priority_label(effective_priority),
828 len(normalized), 1037 len(normalized),
829 effective_normalize, 1038 effective_normalize,
830 active, 1039 active,
@@ -834,7 +1043,10 @@ async def embed_text( @@ -834,7 +1043,10 @@ async def embed_text(
834 ) 1043 )
835 raise HTTPException( 1044 raise HTTPException(
836 status_code=_OVERLOAD_STATUS_CODE, 1045 status_code=_OVERLOAD_STATUS_CODE,
837 - detail=f"Text embedding service busy: active={active}, limit={_TEXT_MAX_INFLIGHT}", 1046 + detail=(
  1047 + "Text embedding service busy for priority=0 requests: "
  1048 + f"active={active}, limit={_TEXT_MAX_INFLIGHT}"
  1049 + ),
838 ) 1050 )
839 1051
840 request_started = time.perf_counter() 1052 request_started = time.perf_counter()
@@ -844,9 +1056,10 @@ async def embed_text( @@ -844,9 +1056,10 @@ async def embed_text(
844 cache_misses = 0 1056 cache_misses = 0
845 try: 1057 try:
846 logger.info( 1058 logger.info(
847 - "embed_text request | client=%s backend=%s inputs=%d normalize=%s active=%d limit=%d preview=%s", 1059 + "embed_text request | client=%s backend=%s priority=%s inputs=%d normalize=%s active=%d limit=%d preview=%s",
848 _request_client(http_request), 1060 _request_client(http_request),
849 _text_backend_name, 1061 _text_backend_name,
  1062 + _priority_label(effective_priority),
850 len(normalized), 1063 len(normalized),
851 effective_normalize, 1064 effective_normalize,
852 active, 1065 active,
@@ -855,13 +1068,20 @@ async def embed_text( @@ -855,13 +1068,20 @@ async def embed_text(
855 extra=_request_log_extra(request_id), 1068 extra=_request_log_extra(request_id),
856 ) 1069 )
857 verbose_logger.info( 1070 verbose_logger.info(
858 - "embed_text detail | payload=%s normalize=%s backend=%s", 1071 + "embed_text detail | payload=%s normalize=%s backend=%s priority=%s",
859 normalized, 1072 normalized,
860 effective_normalize, 1073 effective_normalize,
861 _text_backend_name, 1074 _text_backend_name,
  1075 + _priority_label(effective_priority),
862 extra=_request_log_extra(request_id), 1076 extra=_request_log_extra(request_id),
863 ) 1077 )
864 - result = await run_in_threadpool(_embed_text_impl, normalized, effective_normalize, request_id) 1078 + result = await run_in_threadpool(
  1079 + _submit_text_dispatch_and_wait,
  1080 + normalized,
  1081 + effective_normalize,
  1082 + request_id,
  1083 + effective_priority,
  1084 + )
865 success = True 1085 success = True
866 backend_elapsed_ms = result.backend_elapsed_ms 1086 backend_elapsed_ms = result.backend_elapsed_ms
867 cache_hits = result.cache_hits 1087 cache_hits = result.cache_hits
@@ -875,9 +1095,10 @@ async def embed_text( @@ -875,9 +1095,10 @@ async def embed_text(
875 cache_misses=cache_misses, 1095 cache_misses=cache_misses,
876 ) 1096 )
877 logger.info( 1097 logger.info(
878 - "embed_text response | backend=%s mode=%s inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=%d first_vector=%s latency_ms=%.2f", 1098 + "embed_text response | backend=%s mode=%s priority=%s inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=%d first_vector=%s latency_ms=%.2f",
879 _text_backend_name, 1099 _text_backend_name,
880 result.mode, 1100 result.mode,
  1101 + _priority_label(effective_priority),
881 len(normalized), 1102 len(normalized),
882 effective_normalize, 1103 effective_normalize,
883 len(result.vectors[0]) if result.vectors and result.vectors[0] is not None else 0, 1104 len(result.vectors[0]) if result.vectors and result.vectors[0] is not None else 0,
@@ -888,8 +1109,9 @@ async def embed_text( @@ -888,8 +1109,9 @@ async def embed_text(
888 extra=_request_log_extra(request_id), 1109 extra=_request_log_extra(request_id),
889 ) 1110 )
890 verbose_logger.info( 1111 verbose_logger.info(
891 - "embed_text result detail | count=%d first_vector=%s latency_ms=%.2f", 1112 + "embed_text result detail | count=%d priority=%s first_vector=%s latency_ms=%.2f",
892 len(result.vectors), 1113 len(result.vectors),
  1114 + _priority_label(effective_priority),
893 result.vectors[0][: _VECTOR_PREVIEW_DIMS] 1115 result.vectors[0][: _VECTOR_PREVIEW_DIMS]
894 if result.vectors and result.vectors[0] is not None 1116 if result.vectors and result.vectors[0] is not None
895 else [], 1117 else [],
@@ -909,8 +1131,9 @@ async def embed_text( @@ -909,8 +1131,9 @@ async def embed_text(
909 cache_misses=cache_misses, 1131 cache_misses=cache_misses,
910 ) 1132 )
911 logger.error( 1133 logger.error(
912 - "embed_text failed | backend=%s inputs=%d normalize=%s latency_ms=%.2f error=%s", 1134 + "embed_text failed | backend=%s priority=%s inputs=%d normalize=%s latency_ms=%.2f error=%s",
913 _text_backend_name, 1135 _text_backend_name,
  1136 + _priority_label(effective_priority),
914 len(normalized), 1137 len(normalized),
915 effective_normalize, 1138 effective_normalize,
916 latency_ms, 1139 latency_ms,
@@ -922,8 +1145,9 @@ async def embed_text( @@ -922,8 +1145,9 @@ async def embed_text(
922 finally: 1145 finally:
923 remaining = _text_request_limiter.release(success=success) 1146 remaining = _text_request_limiter.release(success=success)
924 logger.info( 1147 logger.info(
925 - "embed_text finalize | success=%s active_after=%d", 1148 + "embed_text finalize | success=%s priority=%s active_after=%d",
926 success, 1149 success,
  1150 + _priority_label(effective_priority),
927 remaining, 1151 remaining,
928 extra=_request_log_extra(request_id), 1152 extra=_request_log_extra(request_id),
929 ) 1153 )
@@ -1019,6 +1243,7 @@ async def embed_image( @@ -1019,6 +1243,7 @@ async def embed_image(
1019 http_request: Request, 1243 http_request: Request,
1020 response: Response, 1244 response: Response,
1021 normalize: Optional[bool] = None, 1245 normalize: Optional[bool] = None,
  1246 + priority: int = 0,
1022 ) -> List[Optional[List[float]]]: 1247 ) -> List[Optional[List[float]]]:
1023 if _image_model is None: 1248 if _image_model is None:
1024 raise HTTPException(status_code=503, detail="Image embedding model not loaded in this service") 1249 raise HTTPException(status_code=503, detail="Image embedding model not loaded in this service")
@@ -1026,6 +1251,10 @@ async def embed_image( @@ -1026,6 +1251,10 @@ async def embed_image(
1026 request_id = _resolve_request_id(http_request) 1251 request_id = _resolve_request_id(http_request)
1027 response.headers["X-Request-ID"] = request_id 1252 response.headers["X-Request-ID"] = request_id
1028 1253
  1254 + if priority < 0:
  1255 + raise HTTPException(status_code=400, detail="priority must be >= 0")
  1256 + effective_priority = _effective_priority(priority)
  1257 +
1029 effective_normalize = bool(CONFIG.IMAGE_NORMALIZE_EMBEDDINGS) if normalize is None else bool(normalize) 1258 effective_normalize = bool(CONFIG.IMAGE_NORMALIZE_EMBEDDINGS) if normalize is None else bool(normalize)
1030 urls: List[str] = [] 1259 urls: List[str] = []
1031 for i, url_or_path in enumerate(images): 1260 for i, url_or_path in enumerate(images):
@@ -1048,7 +1277,8 @@ async def embed_image( @@ -1048,7 +1277,8 @@ async def embed_image(
1048 cache_misses=0, 1277 cache_misses=0,
1049 ) 1278 )
1050 logger.info( 1279 logger.info(
1051 - "embed_image response | mode=cache-only inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=0 first_vector=%s latency_ms=%.2f", 1280 + "embed_image response | mode=cache-only priority=%s inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=0 first_vector=%s latency_ms=%.2f",
  1281 + _priority_label(effective_priority),
1052 len(urls), 1282 len(urls),
1053 effective_normalize, 1283 effective_normalize,
1054 len(cache_only.vectors[0]) if cache_only.vectors and cache_only.vectors[0] is not None else 0, 1284 len(cache_only.vectors[0]) if cache_only.vectors and cache_only.vectors[0] is not None else 0,
@@ -1059,12 +1289,13 @@ async def embed_image( @@ -1059,12 +1289,13 @@ async def embed_image(
1059 ) 1289 )
1060 return cache_only.vectors 1290 return cache_only.vectors
1061 1291
1062 - accepted, active = _image_request_limiter.try_acquire() 1292 + accepted, active = _image_request_limiter.try_acquire(bypass_limit=effective_priority > 0)
1063 if not accepted: 1293 if not accepted:
1064 _image_stats.record_rejected() 1294 _image_stats.record_rejected()
1065 logger.warning( 1295 logger.warning(
1066 - "embed_image rejected | client=%s inputs=%d normalize=%s active=%d limit=%d preview=%s", 1296 + "embed_image rejected | client=%s priority=%s inputs=%d normalize=%s active=%d limit=%d preview=%s",
1067 _request_client(http_request), 1297 _request_client(http_request),
  1298 + _priority_label(effective_priority),
1068 len(urls), 1299 len(urls),
1069 effective_normalize, 1300 effective_normalize,
1070 active, 1301 active,
@@ -1074,7 +1305,10 @@ async def embed_image( @@ -1074,7 +1305,10 @@ async def embed_image(
1074 ) 1305 )
1075 raise HTTPException( 1306 raise HTTPException(
1076 status_code=_OVERLOAD_STATUS_CODE, 1307 status_code=_OVERLOAD_STATUS_CODE,
1077 - detail=f"Image embedding service busy: active={active}, limit={_IMAGE_MAX_INFLIGHT}", 1308 + detail=(
  1309 + "Image embedding service busy for priority=0 requests: "
  1310 + f"active={active}, limit={_IMAGE_MAX_INFLIGHT}"
  1311 + ),
1078 ) 1312 )
1079 1313
1080 request_started = time.perf_counter() 1314 request_started = time.perf_counter()
@@ -1084,8 +1318,9 @@ async def embed_image( @@ -1084,8 +1318,9 @@ async def embed_image(
1084 cache_misses = 0 1318 cache_misses = 0
1085 try: 1319 try:
1086 logger.info( 1320 logger.info(
1087 - "embed_image request | client=%s inputs=%d normalize=%s active=%d limit=%d preview=%s", 1321 + "embed_image request | client=%s priority=%s inputs=%d normalize=%s active=%d limit=%d preview=%s",
1088 _request_client(http_request), 1322 _request_client(http_request),
  1323 + _priority_label(effective_priority),
1089 len(urls), 1324 len(urls),
1090 effective_normalize, 1325 effective_normalize,
1091 active, 1326 active,
@@ -1094,9 +1329,10 @@ async def embed_image( @@ -1094,9 +1329,10 @@ async def embed_image(
1094 extra=_request_log_extra(request_id), 1329 extra=_request_log_extra(request_id),
1095 ) 1330 )
1096 verbose_logger.info( 1331 verbose_logger.info(
1097 - "embed_image detail | payload=%s normalize=%s", 1332 + "embed_image detail | payload=%s normalize=%s priority=%s",
1098 urls, 1333 urls,
1099 effective_normalize, 1334 effective_normalize,
  1335 + _priority_label(effective_priority),
1100 extra=_request_log_extra(request_id), 1336 extra=_request_log_extra(request_id),
1101 ) 1337 )
1102 result = await run_in_threadpool(_embed_image_impl, urls, effective_normalize, request_id) 1338 result = await run_in_threadpool(_embed_image_impl, urls, effective_normalize, request_id)
@@ -1113,8 +1349,9 @@ async def embed_image( @@ -1113,8 +1349,9 @@ async def embed_image(
1113 cache_misses=cache_misses, 1349 cache_misses=cache_misses,
1114 ) 1350 )
1115 logger.info( 1351 logger.info(
1116 - "embed_image response | mode=%s inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=%d first_vector=%s latency_ms=%.2f", 1352 + "embed_image response | mode=%s priority=%s inputs=%d normalize=%s dim=%d cache_hits=%d cache_misses=%d first_vector=%s latency_ms=%.2f",
1117 result.mode, 1353 result.mode,
  1354 + _priority_label(effective_priority),
1118 len(urls), 1355 len(urls),
1119 effective_normalize, 1356 effective_normalize,
1120 len(result.vectors[0]) if result.vectors and result.vectors[0] is not None else 0, 1357 len(result.vectors[0]) if result.vectors and result.vectors[0] is not None else 0,
@@ -1146,7 +1383,8 @@ async def embed_image( @@ -1146,7 +1383,8 @@ async def embed_image(
1146 cache_misses=cache_misses, 1383 cache_misses=cache_misses,
1147 ) 1384 )
1148 logger.error( 1385 logger.error(
1149 - "embed_image failed | inputs=%d normalize=%s latency_ms=%.2f error=%s", 1386 + "embed_image failed | priority=%s inputs=%d normalize=%s latency_ms=%.2f error=%s",
  1387 + _priority_label(effective_priority),
1150 len(urls), 1388 len(urls),
1151 effective_normalize, 1389 effective_normalize,
1152 latency_ms, 1390 latency_ms,
@@ -1158,8 +1396,9 @@ async def embed_image( @@ -1158,8 +1396,9 @@ async def embed_image(
1158 finally: 1396 finally:
1159 remaining = _image_request_limiter.release(success=success) 1397 remaining = _image_request_limiter.release(success=success)
1160 logger.info( 1398 logger.info(
1161 - "embed_image finalize | success=%s active_after=%d", 1399 + "embed_image finalize | success=%s priority=%s active_after=%d",
1162 success, 1400 success,
  1401 + _priority_label(effective_priority),
1163 remaining, 1402 remaining,
1164 extra=_request_log_extra(request_id), 1403 extra=_request_log_extra(request_id),
1165 ) 1404 )
embeddings/text_encoder.py
@@ -35,7 +35,12 @@ class TextEmbeddingEncoder: @@ -35,7 +35,12 @@ class TextEmbeddingEncoder:
35 expire_time=self.expire_time, 35 expire_time=self.expire_time,
36 ) 36 )
37 37
38 - def _call_service(self, request_data: List[str], normalize_embeddings: bool = True) -> List[Any]: 38 + def _call_service(
  39 + self,
  40 + request_data: List[str],
  41 + normalize_embeddings: bool = True,
  42 + priority: int = 0,
  43 + ) -> List[Any]:
39 """ 44 """
40 Call the embedding service API. 45 Call the embedding service API.
41 46
@@ -48,7 +53,10 @@ class TextEmbeddingEncoder: @@ -48,7 +53,10 @@ class TextEmbeddingEncoder:
48 try: 53 try:
49 response = requests.post( 54 response = requests.post(
50 self.endpoint, 55 self.endpoint,
51 - params={"normalize": "true" if normalize_embeddings else "false"}, 56 + params={
  57 + "normalize": "true" if normalize_embeddings else "false",
  58 + "priority": max(0, int(priority)),
  59 + },
52 json=request_data, 60 json=request_data,
53 timeout=60 61 timeout=60
54 ) 62 )
@@ -62,6 +70,7 @@ class TextEmbeddingEncoder: @@ -62,6 +70,7 @@ class TextEmbeddingEncoder:
62 self, 70 self,
63 sentences: Union[str, List[str]], 71 sentences: Union[str, List[str]],
64 normalize_embeddings: bool = True, 72 normalize_embeddings: bool = True,
  73 + priority: int = 0,
65 device: str = 'cpu', 74 device: str = 'cpu',
66 batch_size: int = 32 75 batch_size: int = 32
67 ) -> np.ndarray: 76 ) -> np.ndarray:
@@ -100,7 +109,11 @@ class TextEmbeddingEncoder: @@ -100,7 +109,11 @@ class TextEmbeddingEncoder:
100 109
101 # If there are uncached texts, call service 110 # If there are uncached texts, call service
102 if uncached_texts: 111 if uncached_texts:
103 - response_data = self._call_service(request_data, normalize_embeddings=normalize_embeddings) 112 + response_data = self._call_service(
  113 + request_data,
  114 + normalize_embeddings=normalize_embeddings,
  115 + priority=priority,
  116 + )
104 117
105 # Process response 118 # Process response
106 for i, text in enumerate(uncached_texts): 119 for i, text in enumerate(uncached_texts):
perf_reports/README.md 0 → 100644
@@ -0,0 +1,34 @@ @@ -0,0 +1,34 @@
  1 +# 性能测试报告索引
  2 +
  3 +本目录存放各次压测/矩阵的原始 JSON 与说明。**推荐复用**仓库脚本,避免重复造轮子:
  4 +
  5 +| 脚本 | 用途 |
  6 +|------|------|
  7 +| `scripts/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `scripts/perf_cases.json.example` |
  8 +
  9 +历史矩阵示例(并发扫描):
  10 +
  11 +- `2026-03-12/matrix_report/summary.md` — 与 `summary.json` 同目录
  12 +
  13 +## 2026-03-20 — 向量服务 `priority` 参数烟测
  14 +
  15 +环境:本机 `127.0.0.1:6005`(文本)、`127.0.0.1:6008`(图片),命令与结果见同目录 JSON:
  16 +
  17 +| 报告文件 | 场景 | 说明 |
  18 +|----------|------|------|
  19 +| `2026-03-20_embed_text_p0.json` | `embed_text` | `priority=0`(默认),8s,并发 10 |
  20 +| `2026-03-20_embed_text_p1.json` | `embed_text` | `--embed-text-priority 1`,8s,并发 10 |
  21 +| `2026-03-20_embed_image_p0.json` | `embed_image` | `priority=0`,8s,并发 5 |
  22 +| `2026-03-20_embed_image_p1.json` | `embed_image` | `--embed-image-priority 1`,8s,并发 5 |
  23 +
  24 +复现示例:
  25 +
  26 +```bash
  27 +source activate.sh
  28 +python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
  29 +python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
  30 +python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
  31 +python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
  32 +```
  33 +
  34 +说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。
query/query_parser.py
@@ -442,7 +442,7 @@ class QueryParser: @@ -442,7 +442,7 @@ class QueryParser:
442 # Submit encoding task to thread pool for async execution 442 # Submit encoding task to thread pool for async execution
443 encoding_executor = ThreadPoolExecutor(max_workers=1) 443 encoding_executor = ThreadPoolExecutor(max_workers=1)
444 def _encode_query_vector() -> Optional[np.ndarray]: 444 def _encode_query_vector() -> Optional[np.ndarray]:
445 - arr = self.text_encoder.encode([query_text]) 445 + arr = self.text_encoder.encode([query_text], priority=1)
446 if arr is None or len(arr) == 0: 446 if arr is None or len(arr) == 0:
447 return None 447 return None
448 vec = arr[0] 448 vec = arr[0]
scripts/perf_api_benchmark.py
@@ -15,6 +15,9 @@ Examples: @@ -15,6 +15,9 @@ Examples:
15 python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162 15 python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
16 python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162 16 python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
17 python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json 17 python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json
  18 + # Embedding admission / priority (query param `priority`; same semantics as embedding service):
  19 + python scripts/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
  20 + python scripts/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
18 """ 21 """
19 22
20 from __future__ import annotations 23 from __future__ import annotations
@@ -72,9 +75,9 @@ def validate_response_payload( @@ -72,9 +75,9 @@ def validate_response_payload(
72 ) -> Tuple[bool, str]: 75 ) -> Tuple[bool, str]:
73 """ 76 """
74 Lightweight payload validation for correctness-aware perf tests. 77 Lightweight payload validation for correctness-aware perf tests.
75 - Currently strict for embed_text to catch NaN/null vector regressions. 78 + Strict for embed_text / embed_image to catch NaN/null vector regressions.
76 """ 79 """
77 - if scenario_name != "embed_text": 80 + if scenario_name not in ("embed_text", "embed_image"):
78 return True, "" 81 return True, ""
79 82
80 expected_len = len(tpl.json_body) if isinstance(tpl.json_body, list) else None 83 expected_len = len(tpl.json_body) if isinstance(tpl.json_body, list) else None
@@ -219,6 +222,43 @@ def load_cases_from_file(path: Path, tenant_id: str) -&gt; Dict[str, List[RequestTe @@ -219,6 +222,43 @@ def load_cases_from_file(path: Path, tenant_id: str) -&gt; Dict[str, List[RequestTe
219 return out 222 return out
220 223
221 224
  225 +def apply_embed_priority_params(
  226 + scenarios: Dict[str, Scenario],
  227 + embed_text_priority: int,
  228 + embed_image_priority: int,
  229 +) -> None:
  230 + """
  231 + Merge default `priority` query param into embed templates when absent.
  232 + `scripts/perf_cases.json` may set per-request `params.priority` to override.
  233 + """
  234 + mapping = {
  235 + "embed_text": max(0, int(embed_text_priority)),
  236 + "embed_image": max(0, int(embed_image_priority)),
  237 + }
  238 + for name, pri in mapping.items():
  239 + if name not in scenarios:
  240 + continue
  241 + scen = scenarios[name]
  242 + new_templates: List[RequestTemplate] = []
  243 + for t in scen.templates:
  244 + params = dict(t.params or {})
  245 + params.setdefault("priority", str(pri))
  246 + new_templates.append(
  247 + RequestTemplate(
  248 + method=t.method,
  249 + path=t.path,
  250 + params=params,
  251 + json_body=t.json_body,
  252 + headers=t.headers,
  253 + )
  254 + )
  255 + scenarios[name] = Scenario(
  256 + name=scen.name,
  257 + templates=new_templates,
  258 + timeout_sec=scen.timeout_sec,
  259 + )
  260 +
  261 +
222 def build_scenarios(args: argparse.Namespace) -> Dict[str, Scenario]: 262 def build_scenarios(args: argparse.Namespace) -> Dict[str, Scenario]:
223 defaults = make_default_templates(args.tenant_id) 263 defaults = make_default_templates(args.tenant_id)
224 if args.cases_file: 264 if args.cases_file:
@@ -252,6 +292,11 @@ def build_scenarios(args: argparse.Namespace) -&gt; Dict[str, Scenario]: @@ -252,6 +292,11 @@ def build_scenarios(args: argparse.Namespace) -&gt; Dict[str, Scenario]:
252 ) 292 )
253 ) 293 )
254 scenarios[name] = Scenario(name=name, templates=rewritten, timeout_sec=args.timeout) 294 scenarios[name] = Scenario(name=name, templates=rewritten, timeout_sec=args.timeout)
  295 + apply_embed_priority_params(
  296 + scenarios,
  297 + embed_text_priority=args.embed_text_priority,
  298 + embed_image_priority=args.embed_image_priority,
  299 + )
255 return scenarios 300 return scenarios
256 301
257 302
@@ -483,6 +528,18 @@ def parse_args() -&gt; argparse.Namespace: @@ -483,6 +528,18 @@ def parse_args() -&gt; argparse.Namespace:
483 default=0, 528 default=0,
484 help="Optional top_n for rerank requests in dynamic docs mode (0 means omit top_n).", 529 help="Optional top_n for rerank requests in dynamic docs mode (0 means omit top_n).",
485 ) 530 )
  531 + parser.add_argument(
  532 + "--embed-text-priority",
  533 + type=int,
  534 + default=0,
  535 + help="Default query param priority= for embed_text (0=offline admission; >0 bypasses rejection). Merged into params unless set in --cases-file.",
  536 + )
  537 + parser.add_argument(
  538 + "--embed-image-priority",
  539 + type=int,
  540 + default=0,
  541 + help="Default query param priority= for embed_image (same semantics as embed-text-priority).",
  542 + )
486 return parser.parse_args() 543 return parser.parse_args()
487 544
488 545
@@ -609,6 +666,8 @@ async def main_async() -&gt; int: @@ -609,6 +666,8 @@ async def main_async() -&gt; int:
609 print(f" embedding_image_base={args.embedding_image_base}") 666 print(f" embedding_image_base={args.embedding_image_base}")
610 print(f" translator_base={args.translator_base}") 667 print(f" translator_base={args.translator_base}")
611 print(f" reranker_base={args.reranker_base}") 668 print(f" reranker_base={args.reranker_base}")
  669 + print(f" embed_text_priority={args.embed_text_priority}")
  670 + print(f" embed_image_priority={args.embed_image_priority}")
612 if args.rerank_dynamic_docs: 671 if args.rerank_dynamic_docs:
613 print(" rerank_dynamic_docs=True") 672 print(" rerank_dynamic_docs=True")
614 print(f" rerank_doc_count={args.rerank_doc_count}") 673 print(f" rerank_doc_count={args.rerank_doc_count}")
@@ -667,6 +726,8 @@ async def main_async() -&gt; int: @@ -667,6 +726,8 @@ async def main_async() -&gt; int:
667 "rerank_query": args.rerank_query, 726 "rerank_query": args.rerank_query,
668 "rerank_seed": args.rerank_seed, 727 "rerank_seed": args.rerank_seed,
669 "rerank_top_n": args.rerank_top_n, 728 "rerank_top_n": args.rerank_top_n,
  729 + "embed_text_priority": args.embed_text_priority,
  730 + "embed_image_priority": args.embed_image_priority,
670 }, 731 },
671 "results": results, 732 "results": results,
672 "overall": aggregate_results(results), 733 "overall": aggregate_results(results),
scripts/perf_cases.json.example
@@ -32,9 +32,18 @@ @@ -32,9 +32,18 @@
32 { 32 {
33 "method": "POST", 33 "method": "POST",
34 "path": "/embed/text", 34 "path": "/embed/text",
  35 + "params": {"priority": "0"},
35 "json": ["wireless mouse", "gaming keyboard", "USB-C cable", "barbie doll"] 36 "json": ["wireless mouse", "gaming keyboard", "USB-C cable", "barbie doll"]
36 } 37 }
37 ], 38 ],
  39 + "embed_image": [
  40 + {
  41 + "method": "POST",
  42 + "path": "/embed/image",
  43 + "params": {"normalize": "true", "priority": "0"},
  44 + "json": ["/data/saas-search/docs/image-dress1.png"]
  45 + }
  46 + ],
38 "translate": [ 47 "translate": [
39 { 48 {
40 "method": "POST", 49 "method": "POST",
search/searcher.py
@@ -791,7 +791,7 @@ class Searcher: @@ -791,7 +791,7 @@ class Searcher:
791 # Generate image embedding 791 # Generate image embedding
792 if self.image_encoder is None: 792 if self.image_encoder is None:
793 raise RuntimeError("Image encoder is not initialized at startup") 793 raise RuntimeError("Image encoder is not initialized at startup")
794 - image_vector = self.image_encoder.encode_image_from_url(image_url) 794 + image_vector = self.image_encoder.encode_image_from_url(image_url, priority=1)
795 795
796 if image_vector is None: 796 if image_vector is None:
797 raise ValueError(f"Failed to encode image: {image_url}") 797 raise ValueError(f"Failed to encode image: {image_url}")
tests/ci/test_service_api_contracts.py
@@ -540,7 +540,15 @@ def test_indexer_index_validation_max_delete_spu_ids(indexer_client: TestClient) @@ -540,7 +540,15 @@ def test_indexer_index_validation_max_delete_spu_ids(indexer_client: TestClient)
540 540
541 541
542 class _FakeTextModel: 542 class _FakeTextModel:
543 - def encode_batch(self, texts, batch_size=32, device="cpu", normalize_embeddings=True): 543 + """Matches TEI / server path: `_text_model.encode(...)` (not encode_batch)."""
  544 +
  545 + def encode(
  546 + self,
  547 + texts,
  548 + batch_size=32,
  549 + device="cpu",
  550 + normalize_embeddings=True,
  551 + ):
544 return [np.array([0.1, 0.2, 0.3], dtype=np.float32) for _ in texts] 552 return [np.array([0.1, 0.2, 0.3], dtype=np.float32) for _ in texts]
545 553
546 554
@@ -549,6 +557,18 @@ class _FakeImageModel: @@ -549,6 +557,18 @@ class _FakeImageModel:
549 return [np.array([0.3, 0.2, 0.1], dtype=np.float32) for _ in urls] 557 return [np.array([0.3, 0.2, 0.1], dtype=np.float32) for _ in urls]
550 558
551 559
  560 +class _EmbeddingCacheMiss:
  561 + """Avoid Redis/module cache hits so contract tests exercise the encode path."""
  562 +
  563 + redis_client = None
  564 +
  565 + def get(self, key):
  566 + return None
  567 +
  568 + def set(self, key, value):
  569 + return True
  570 +
  571 +
552 @pytest.fixture 572 @pytest.fixture
553 def embedding_module(): 573 def embedding_module():
554 import embeddings.server as emb_server 574 import embeddings.server as emb_server
@@ -556,17 +576,31 @@ def embedding_module(): @@ -556,17 +576,31 @@ def embedding_module():
556 emb_server.app.router.on_startup.clear() 576 emb_server.app.router.on_startup.clear()
557 emb_server._text_model = _FakeTextModel() 577 emb_server._text_model = _FakeTextModel()
558 emb_server._image_model = _FakeImageModel() 578 emb_server._image_model = _FakeImageModel()
  579 + emb_server._text_backend_name = "tei"
  580 + emb_server._text_cache = _EmbeddingCacheMiss()
  581 + emb_server._image_cache = _EmbeddingCacheMiss()
559 yield emb_server 582 yield emb_server
560 583
561 584
562 def test_embedding_text_contract(embedding_module): 585 def test_embedding_text_contract(embedding_module):
563 - data = embedding_module.embed_text(["hello", "world"]) 586 + """Contract via HTTP like production; route handlers require Request/Response."""
  587 + from fastapi.testclient import TestClient
  588 +
  589 + with TestClient(embedding_module.app) as client:
  590 + resp = client.post("/embed/text", json=["hello", "world"])
  591 + assert resp.status_code == 200
  592 + data = resp.json()
564 assert len(data) == 2 593 assert len(data) == 2
565 assert len(data[0]) == 3 594 assert len(data[0]) == 3
566 595
567 596
568 def test_embedding_image_contract(embedding_module): 597 def test_embedding_image_contract(embedding_module):
569 - data = embedding_module.embed_image(["https://example.com/a.jpg"]) 598 + from fastapi.testclient import TestClient
  599 +
  600 + with TestClient(embedding_module.app) as client:
  601 + resp = client.post("/embed/image", json=["https://example.com/a.jpg"])
  602 + assert resp.status_code == 200
  603 + data = resp.json()
570 assert len(data[0]) == 3 604 assert len(data[0]) == 3
571 605
572 606
tests/test_embedding_pipeline.py
@@ -63,7 +63,11 @@ class _FakeTranslator: @@ -63,7 +63,11 @@ class _FakeTranslator:
63 63
64 64
65 class _FakeQueryEncoder: 65 class _FakeQueryEncoder:
  66 + def __init__(self):
  67 + self.calls = []
  68 +
66 def encode(self, sentences, **kwargs): 69 def encode(self, sentences, **kwargs):
  70 + self.calls.append({"sentences": sentences, "kwargs": dict(kwargs)})
67 if isinstance(sentences, str): 71 if isinstance(sentences, str):
68 sentences = [sentences] 72 sentences = [sentences]
69 return np.array([np.array([0.11, 0.22, 0.33], dtype=np.float32) for _ in sentences], dtype=object) 73 return np.array([np.array([0.11, 0.22, 0.33], dtype=np.float32) for _ in sentences], dtype=object)
@@ -98,9 +102,7 @@ def _build_test_config() -&gt; SearchConfig: @@ -98,9 +102,7 @@ def _build_test_config() -&gt; SearchConfig:
98 rerank=RerankConfig(), 102 rerank=RerankConfig(),
99 spu_config=SPUConfig(enabled=True, spu_field="spu_id", inner_hits_size=3), 103 spu_config=SPUConfig(enabled=True, spu_field="spu_id", inner_hits_size=3),
100 es_index_name="test_products", 104 es_index_name="test_products",
101 - tenant_config={},  
102 es_settings={}, 105 es_settings={},
103 - services={},  
104 ) 106 )
105 107
106 108
@@ -111,6 +113,7 @@ def test_text_embedding_encoder_response_alignment(monkeypatch): @@ -111,6 +113,7 @@ def test_text_embedding_encoder_response_alignment(monkeypatch):
111 def _fake_post(url, json, timeout, **kwargs): 113 def _fake_post(url, json, timeout, **kwargs):
112 assert url.endswith("/embed/text") 114 assert url.endswith("/embed/text")
113 assert json == ["hello", "world"] 115 assert json == ["hello", "world"]
  116 + assert kwargs["params"]["priority"] == 0
114 return _FakeResponse([[0.1, 0.2], [0.3, 0.4]]) 117 return _FakeResponse([[0.1, 0.2], [0.3, 0.4]])
115 118
116 monkeypatch.setattr("embeddings.text_encoder.requests.post", _fake_post) 119 monkeypatch.setattr("embeddings.text_encoder.requests.post", _fake_post)
@@ -172,6 +175,7 @@ def test_image_embedding_encoder_cache_hit(monkeypatch): @@ -172,6 +175,7 @@ def test_image_embedding_encoder_cache_hit(monkeypatch):
172 175
173 def _fake_post(url, params, json, timeout, **kwargs): 176 def _fake_post(url, params, json, timeout, **kwargs):
174 calls["count"] += 1 177 calls["count"] += 1
  178 + assert params["priority"] == 0
175 return _FakeResponse([[0.1, 0.2]]) 179 return _FakeResponse([[0.1, 0.2]])
176 180
177 monkeypatch.setattr("embeddings.image_encoder.requests.post", _fake_post) 181 monkeypatch.setattr("embeddings.image_encoder.requests.post", _fake_post)
@@ -184,16 +188,35 @@ def test_image_embedding_encoder_cache_hit(monkeypatch): @@ -184,16 +188,35 @@ def test_image_embedding_encoder_cache_hit(monkeypatch):
184 assert np.allclose(out[1], np.array([0.1, 0.2], dtype=np.float32)) 188 assert np.allclose(out[1], np.array([0.1, 0.2], dtype=np.float32))
185 189
186 190
  191 +def test_image_embedding_encoder_passes_priority(monkeypatch):
  192 + fake_cache = _FakeEmbeddingCache()
  193 + monkeypatch.setattr("embeddings.image_encoder.RedisEmbeddingCache", lambda **kwargs: fake_cache)
  194 +
  195 + def _fake_post(url, params, json, timeout, **kwargs):
  196 + assert params["priority"] == 1
  197 + return _FakeResponse([[0.1, 0.2]])
  198 +
  199 + monkeypatch.setattr("embeddings.image_encoder.requests.post", _fake_post)
  200 +
  201 + encoder = CLIPImageEncoder(service_url="http://127.0.0.1:6008")
  202 + out = encoder.encode_batch(["https://example.com/a.jpg"], priority=1)
  203 + assert len(out) == 1
  204 + assert np.allclose(out[0], np.array([0.1, 0.2], dtype=np.float32))
  205 +
  206 +
187 def test_query_parser_generates_query_vector_with_encoder(): 207 def test_query_parser_generates_query_vector_with_encoder():
  208 + encoder = _FakeQueryEncoder()
188 parser = QueryParser( 209 parser = QueryParser(
189 config=_build_test_config(), 210 config=_build_test_config(),
190 - text_encoder=_FakeQueryEncoder(), 211 + text_encoder=encoder,
191 translator=_FakeTranslator(), 212 translator=_FakeTranslator(),
192 ) 213 )
193 214
194 parsed = parser.parse("red dress", tenant_id="162", generate_vector=True) 215 parsed = parser.parse("red dress", tenant_id="162", generate_vector=True)
195 assert parsed.query_vector is not None 216 assert parsed.query_vector is not None
196 assert parsed.query_vector.shape == (3,) 217 assert parsed.query_vector.shape == (3,)
  218 + assert encoder.calls
  219 + assert encoder.calls[0]["kwargs"]["priority"] == 1
197 220
198 221
199 def test_query_parser_skips_query_vector_when_disabled(): 222 def test_query_parser_skips_query_vector_when_disabled():
tests/test_embedding_service_limits.py
@@ -69,6 +69,8 @@ def test_health_exposes_limit_stats(monkeypatch): @@ -69,6 +69,8 @@ def test_health_exposes_limit_stats(monkeypatch):
69 69
70 70
71 def test_embed_image_rejects_when_image_lane_is_full(monkeypatch): 71 def test_embed_image_rejects_when_image_lane_is_full(monkeypatch):
  72 + # Ensure no cache hit (module-level Redis cache may contain this URL from other tests).
  73 + monkeypatch.setattr(embedding_server, "_image_cache", _FakeCache({}))
72 limiter = embedding_server._InflightLimiter("image", 1) 74 limiter = embedding_server._InflightLimiter("image", 1)
73 acquired, _ = limiter.try_acquire() 75 acquired, _ = limiter.try_acquire()
74 assert acquired is True 76 assert acquired is True
tests/test_embedding_service_priority.py 0 → 100644
@@ -0,0 +1,81 @@ @@ -0,0 +1,81 @@
  1 +import threading
  2 +
  3 +import embeddings.server as emb_server
  4 +
  5 +
  6 +def test_text_inflight_limiter_priority_bypass():
  7 + limiter = emb_server._InflightLimiter(name="text", limit=1)
  8 +
  9 + accepted, active = limiter.try_acquire()
  10 + assert accepted is True
  11 + assert active == 1
  12 +
  13 + accepted, active = limiter.try_acquire()
  14 + assert accepted is False
  15 + assert active == 1
  16 +
  17 + accepted, active = limiter.try_acquire(bypass_limit=True)
  18 + assert accepted is True
  19 + assert active == 2
  20 +
  21 + snapshot = limiter.snapshot()
  22 + assert snapshot["priority_bypass_total"] == 1
  23 +
  24 + limiter.release(success=True)
  25 + limiter.release(success=True)
  26 +
  27 +
  28 +def test_text_dispatch_prefers_high_priority_queue():
  29 + high_task = emb_server._TextDispatchTask(
  30 + normalized=["online"],
  31 + effective_normalize=True,
  32 + request_id="high",
  33 + priority=1,
  34 + created_at=0.0,
  35 + done=threading.Event(),
  36 + )
  37 + normal_task = emb_server._TextDispatchTask(
  38 + normalized=["offline"],
  39 + effective_normalize=True,
  40 + request_id="normal",
  41 + priority=0,
  42 + created_at=0.0,
  43 + done=threading.Event(),
  44 + )
  45 +
  46 + with emb_server._text_dispatch_cv:
  47 + emb_server._text_dispatch_high_queue.clear()
  48 + emb_server._text_dispatch_normal_queue.clear()
  49 + emb_server._text_dispatch_normal_queue.append(normal_task)
  50 + emb_server._text_dispatch_high_queue.append(high_task)
  51 +
  52 + first = emb_server._pop_text_dispatch_task_locked()
  53 + second = emb_server._pop_text_dispatch_task_locked()
  54 +
  55 + emb_server._text_dispatch_high_queue.clear()
  56 + emb_server._text_dispatch_normal_queue.clear()
  57 +
  58 + assert first is high_task
  59 + assert second is normal_task
  60 +
  61 +
  62 +def test_image_inflight_limiter_priority_bypass():
  63 + limiter = emb_server._InflightLimiter(name="image", limit=1)
  64 +
  65 + accepted, active = limiter.try_acquire()
  66 + assert accepted is True
  67 + assert active == 1
  68 +
  69 + accepted, active = limiter.try_acquire()
  70 + assert accepted is False
  71 + assert active == 1
  72 +
  73 + accepted, active = limiter.try_acquire(bypass_limit=True)
  74 + assert accepted is True
  75 + assert active == 2
  76 +
  77 + snapshot = limiter.snapshot()
  78 + assert snapshot["priority_bypass_total"] == 1
  79 +
  80 + limiter.release(success=True)
  81 + limiter.release(success=True)