diff --git a/docs/issue-2026-03-29-索引修改-done-0330.md b/docs/issue-2026-03-29-索引修改-done-0330.md new file mode 100644 index 0000000..f6e33a1 --- /dev/null +++ b/docs/issue-2026-03-29-索引修改-done-0330.md @@ -0,0 +1,43 @@ + + + + + +工程(金伟)配合修改: + + +一、tags字段改支持多语言: +spu表tags字段,跟title走一样的翻译逻辑,填入原始语言、zh、en。 + +检查以下字段,都跟title一样走翻译逻辑 +title +keywords +tags +brief +description +vendor +category_path +category_name_text + + +二、/indexer/enrich-content接口的修改 +1. 请求参数,把language去掉,因为我返回的内容直接对应索引结构,不用你做处理了,因此不需要指定语言,降低耦合。 +2. 返回 enriched_attributes enriched_tags qanchors三个字段,按原始内容填入。 +3. enriched_tags是本次新增的,注意区别于tags字段。tags字段来源于mysql spu表,enriched_tags是本接口返回的。 + + +三、specifications的value,需要翻译,也是需要填中英文: +{ + "specifications": [ + { + "sku_id": "sku-red-s", + "name": "color", + "value_keyword": "красный", + "value_text": { + "zh": "红色", + "en": "red" + } + } + ] +} + diff --git a/docs/issue-2026-03-30-query分析性能优化-done-0331.md b/docs/issue-2026-03-30-query分析性能优化-done-0331.md new file mode 100644 index 0000000..0a3f6f4 --- /dev/null +++ b/docs/issue-2026-03-30-query分析性能优化-done-0331.md @@ -0,0 +1,264 @@ + +总体的目的是: +1)要对原始query进行翻译(通常是en/zh的一种或者两种) +2)对原始query要有关键词提取(关键词提取可能依赖分词;中英文分词要不一样,各自寻求最优性能的方法。zh的可以保持不变,en的可以优化) +3)其他的一些任务可能依赖分词 +4)获取text embedding/clip embedding + + +英文关键词提取:走spacy进行关键词提取(即主干分析)。提取出query中的核心词,用于搜索时候的term求交,其余的词不参与求交、只用于权重计算。 + + +实现: + +# Query 模块说明 + +本目录实现搜索请求侧的**查询理解与解析**:在不做 Elasticsearch 语言计划拼装的前提下,产出可供检索层、重排层与调试界面消费的**结构化事实**(规范化文本、检测语言、可选翻译、文本与 CLIP 向量、分词与关键词、可选的样式意图与标题排除配置等)。下面按**当前实现**说明策略与数据流,便于与 `search/`、`context/`、`frontend/` 对照阅读。 + +--- + +## 包内文件与职责 + +| 文件 | 作用 | +|------|------| +| `query_parser.py` | 入口 `QueryParser`:编排规范化、改写、语言检测、异步翻译与向量、分词、关键词、意图与排除检测;定义 `ParsedQuery`。 | +| `tokenization.py` | 轻量分词、文本规范化、`TokenizedText` 与按请求复用的 `QueryTextAnalysisCache`(模型分词与语言提示、粗细分词策略)。 | +| `keyword_extractor.py` | `KeywordExtractor`:中文走 HanLP 分词 + 词性名词串;英文走 spaCy 核心词;`collect_keywords_queries` 汇总 `base` 与各翻译语种。 | +| `english_keyword_extractor.py` | `EnglishKeywordExtractor`:`en_core_web_sm` + 依存/名词块规则,产出短字符串供检索侧关键词子句使用。 | +| `language_detector.py` | 脚本优先 + Lingua 的通用语言检测(与 `QueryParser` 的英文 ASCII 快路径配合使用)。 | +| `query_rewriter.py` | 基于配置词典的查询改写与规范化。 | +| `style_intent.py` | 从配置加载样式意图词表,对查询变体做候选匹配,产出 `StyleIntentProfile`。 | +| `product_title_exclusion.py` | 从配置加载标题排除规则,对多路查询文本做触发词匹配,产出 `ProductTitleExclusionProfile`。 | + +公开符号见 `query/__init__.py`(`QueryParser`、`ParsedQuery`、`KEYWORDS_QUERY_BASE_KEY` 等)。 + +--- + +## 解析产物:`ParsedQuery` + +`ParsedQuery` 是单次 `parse()` 的权威结果容器,字段含义与下游约定如下。 + +- **`original_query` / `query_normalized` / `rewritten_query`**:分别为原始输入、规范化后、词典改写后的主查询文本;后续翻译、向量、默认分词与 `base` 关键词均以**改写后的 `rewritten_query`(在代码变量中常名为 `query_text`)**为基准。 +- **`detected_language`**:解析时认定的源语言代码;若检测为 `unknown` 或空,则回退到 `SearchConfig.query_config.default_language`。 +- **`translations`**:键为**目标语言代码**(如 `zh`、`en`),值为翻译服务返回的字符串;仅包含本次请求实际需要的目标语种(见下文翻译目标推导)。 +- **`query_vector` / `image_query_vector`**:分别为 BGE 类文本向量与 CLIP 文本向量(维度由各自编码服务决定);未生成或未在超时内完成则为 `None`。 +- **`query_tokens`**:对**改写后主查询**做分词后的字符串列表,供例如 KNN 参数按 token 数分支等逻辑使用;分词路径由 `QueryTextAnalysisCache` 决定(纯拉丁英文可走轻量分词,含汉字则走 HanLP)。 +- **`keywords_queries`**:与「主查询 + 各翻译变体」平行的**关键词子查询**映射:键 `base`(常量 `KEYWORDS_QUERY_BASE_KEY`)对应源语言侧关键词串,其它键与 `translations` 的语种键一致。空串或无法提取的条目**不会写入**字典。 +- **`style_intent_profile` / `product_title_exclusion_profile`**:可选的理解结果;是否生效完全由 `config.yaml` 中 `query_config` 的对应开关与词表/规则决定。 +- **`_text_analysis_cache`**:单次解析内的分词与语言提示缓存,**不参与序列化**,仅供同一次 `parse` 内各检测器复用,避免对同一文本重复调用 HanLP。 + +与重排相关的文本选择由独立函数 `rerank_query_text()` 完成:检测为 `zh` 或 `en` 时始终用原始查询;其它语言优先英译再中译,见 `query_parser.py` 中实现。 + +--- + +## `QueryParser.parse()` 的执行顺序与策略 + +解析主流程在 `QueryParser.parse()` 中实现。整体目标是:在**共享等待预算**下并行完成翻译与向量请求,同时尽量减少主线程上重复、昂贵的分词与 NLP 调用,并把结果写入可选的 `context`(请求上下文)供日志与 `debug_info` 使用。 + +### 1. 规范化与改写 + +- 使用 `QueryNormalizer` 得到 `query_normalized` 并可选写入 `context.store_intermediate_result('query_normalized', ...)`。 +- 若配置了改写词典,则用 `QueryRewriter` 可能更新 `query_text`;改写成功时记录 `rewritten_query` 与告警。 + +### 2. 语言检测:通用路径与英文 ASCII 快路径 + +- **快路径**:当「活跃语种集合」仅为 `en` 与 `zh` 的子集时(活跃集合取 `target_languages` 归一化结果,若为空则回退到 `query_config.supported_languages`),且当前查询为**纯 ASCII、含字母、不含汉字**,则**直接判定为 `en`**,不再调用 `LanguageDetector`(避免 Lingua 等开销)。逻辑见 `_detect_query_language()` 与 `_is_ascii_latin_query()`。 + +```303:317:query/query_parser.py + def _detect_query_language( + self, + query_text: str, + *, + target_languages: Optional[List[str]] = None, + ) -> str: + normalized_targets = self._normalize_language_codes(target_languages) + supported_languages = self._normalize_language_codes( + getattr(self.config.query_config, "supported_languages", None) + ) + active_languages = normalized_targets or supported_languages + if active_languages and set(active_languages).issubset({"en", "zh"}): + if self._is_ascii_latin_query(query_text): + return "en" + return self.language_detector.detect(query_text) +``` + +- **通用路径**:`LanguageDetector` 先按 Unicode 脚本返回明确语种(如汉字块即 `zh`),否则用 Lingua 在一大组语言中判别,见 `language_detector.py`。 + +检测最终结果写入 `context.store_intermediate_result('detected_language', ...)`(若提供 `context`)。 + +### 3. 按请求分词缓存与语言提示 + +每次 `parse` 会新建 `QueryTextAnalysisCache(tokenizer=self._tokenizer)`,并对**原始串、规范化串、改写后串**调用 `set_language_hint(..., detected_lang)`,使后续对同一文本的 `get_tokenizer_result` / `get_tokenized_text` 能按语言选择**是否调用 HanLP**。 + +### 4. HanLP 模型(与 `KeywordExtractor` 对齐) + +`QueryParser` 默认构建的 `self._tokenizer` 为 HanLP 预训练分词模型 **`FINE_ELECTRA_SMALL_ZH`**,并开启 `output_spans=True`,以便与关键词提取共用「带偏移的分词结果」。 + +```237:245:query/query_parser.py + def _build_tokenizer(self) -> Callable[[str], Any]: + """Build the tokenizer used by query parsing. No fallback path by design.""" + if hanlp is None: + raise RuntimeError("HanLP is required for QueryParser tokenization") + logger.info("Initializing HanLP tokenizer...") + tokenizer = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH) + tokenizer.config.output_spans = True + logger.info("HanLP tokenizer initialized") + return tokenizer +``` + +`KeywordExtractor` 在未注入自定义 `tokenizer` 时同样加载 **`FINE_ELECTRA_SMALL_ZH`**,并额外加载 **`CTB9_POS_ELECTRA_SMALL`** 做词性标注;二者在「中文路径」上语义一致,便于复用 `tokenizer_result`。 + +### 5. 异步富集:翻译、文本向量、CLIP 文本向量 + +- 翻译目标:`translation_targets = normalized_targets` 中**去掉与检测源语言相同**的代码后的列表(例如源为 `en` 且索引语言为 `["en","zh"]` 时只翻 `zh`)。 +- 翻译模型名:由 `_pick_query_translation_model()` 根据「源语言是否在索引语言内」及 `zh↔en` 等分支从 `QueryConfig` 选取。 +- 当 `generate_vector` 为真且配置开启文本嵌入时,向线程池提交 `text_encoder.encode([query_text], ...)`;当配置了 `image_embedding_field` 时提交 `image_encoder.encode_clip_text(query_text, ...)`。 +- 线程池:`ThreadPoolExecutor`,`max_workers` 为 `min(任务数, 4)` 与至少 1。 +- **提交顺序**:先尽可能提交所有异步任务,再在主线程上做「与异步重叠」的轻量工作(见下一节),最后 `concurrent.futures.wait(..., timeout=budget_sec)`。超时未完成的任务会记 warning,并 `shutdown(wait=False)` 不阻塞关闭线程池。 + +等待预算(毫秒)来自 `QueryConfig`: + +- 源语言在索引语言内:`translation_embedding_wait_budget_ms_source_in_index` +- 否则:`translation_embedding_wait_budget_ms_source_not_in_index` + +完成每个 future 后打 `Async enrichment task finished` 日志(含 `elapsed_ms`,为从提交到完成的大致墙钟时间)。 + +### 6. 主查询分词与「base」关键词(与异步重叠) + +在异步任务已提交之后、`wait()` 之前,当前实现会: + +1. 通过 `text_analysis_cache.get_tokenizer_result(query_text)` 得到分词结果,再 `extract_token_strings` 得到 **`query_tokens`**; +2. 调用 `KeywordExtractor.extract_keywords(query_text, language_hint=detected_lang, tokenizer_result=...)` 得到 **`keywords_base_query`**(若失败则日志告警,base 关键词可能为空)。 + +这样主线程在等翻译/向量时,已并行完成源侧分词与源侧关键词的大部分工作。 + +### 7. 等待结束后的关键词汇总与检测器 + +`wait()` 返回后: + +- 若有翻译结果,写入 `context.store_intermediate_result("translations", translations)`,并对每条翻译 `text_analysis_cache.set_language_hint(result, lang)`,保证后续对该翻译串的分词/关键词走正确语言路径。 +- `collect_keywords_queries(...)` 合并 **`base`**(可传入已算好的 `base_keywords_query` 避免重复抽取)与各翻译语种的关键词,得到 **`keywords_queries`**;成功时 `context.store_intermediate_result("keywords_queries", keywords_queries)` 并打 `Keyword extraction completed` 日志。 +- 构造带 `_text_analysis_cache` 的 `ParsedQuery` 草稿,依次调用 `StyleIntentDetector.detect` 与 `ProductTitleExclusionDetector.detect`,再把完整 `ParsedQuery` 返回。 + +解析阶段会打聚合耗时日志 `Query parse stage timings`,字段含义为: + +- **`before_wait_ms`**:从解析开始计时点到进入 `wait()` 之前的主线程耗时(含规范化、改写、语言检测、提交异步任务、主查询分词、base 关键词等); +- **`async_wait_ms`**:`wait()` 阻塞时间; +- **`base_keywords_ms`**:base 关键词抽取耗时; +- **`keyword_tail_ms`**:`collect_keywords_queries` 及前后尾部逻辑中关键词相关部分的主要耗时; +- **`tail_sync_ms`**:`wait()` 之后整段同步尾巴(含关键词汇总、两检测器、写中间结果等)。 + +--- + +## 分词与 `QueryTextAnalysisCache` + +### `get_tokenizer_result`:何时走 HanLP,何时走轻量分词 + +- 若未配置模型 `tokenizer`,直接返回空列表路径的轻量结果(由上层避免依赖)。 +- 若根据**该文本的语言提示**与**是否含汉字**判断不需要模型:返回 `simple_tokenize_query` 的列表(字符串 token),**不调用 HanLP**。 +- 否则对该文本调用一次 `self.tokenizer(text)`(HanLP),结果按文本缓存,同一次 `parse` 内重复访问同一字符串不会重复推理。 + +核心判断在 `_should_use_model_tokenizer`:**语言提示为 `zh` 时,仅当文本含汉字才用模型**;非 `zh` 提示时,仅当文本含汉字才用模型。因此纯英文主查询在提示为 `en` 时走轻量分词;中文翻译串在 `set_language_hint(..., "zh")` 且含汉字时走 HanLP。 + +### `coarse_tokens` 与 `fine_tokens`:`TokenizedText` + +- **`fine_tokens`**:来自 `extract_token_strings(get_tokenizer_result(...))`,在中文路径上即 HanLP 分词后的词串(已按规范化键去重保序)。 +- **`coarse_tokens`**:由 `_build_coarse_tokens` 决定。若语言提示为 **`zh`**,或文本含汉字且已有 `tokenizer_tokens`,则 **粗粒度 token 与 fine 一致**(即采用模型分词粒度,而不用「整段 CJK 连成一项」的纯正则策略)。否则使用 `simple_tokenize_query`(适合拉丁词、数字、带连字符/撇号的英文词形)。 + +```92:103:query/tokenization.py +def _build_coarse_tokens( + text: str, + *, + language_hint: Optional[str], + tokenizer_tokens: Sequence[str], +) -> List[str]: + normalized_language = normalize_query_text(language_hint) + if normalized_language == "zh" or (contains_han_text(text) and tokenizer_tokens): + # Chinese coarse tokenization should follow the model tokenizer rather than a + # regex that collapses the whole sentence into one CJK span. + return list(_dedupe_preserve_order(tokenizer_tokens)) + return _dedupe_preserve_order(simple_tokenize_query(text)) +``` + +- **`candidates`**:在 fine、coarse、两类 n-gram 短语(上限由 `max_ngram` 控制)以及整句 `normalized_text` 上合并去重,供 `StyleIntentDetector`、`ProductTitleExclusionDetector` 等做子串/短语级匹配。 + +`tokenize_text()` 是对单次无缓存场景的薄封装:内部新建 `QueryTextAnalysisCache` 再 `get_tokenized_text`。 + +--- + +## 关键词提取:`KeywordExtractor` 与 `collect_keywords_queries` + +### 路由规则 + +`extract_keywords` 根据 `language_hint` 分支: + +- **`en`**:完全交给 `EnglishKeywordExtractor`(spaCy),**不使用** HanLP 分词结果做 POS 名词筛选(即使调用方传入 `tokenizer_result` 也会被忽略在该路径内)。 +- **`zh`**:使用 HanLP 分词结果(优先复用传入的 `tokenizer_result`),再对词序列跑 `CTB9_POS_ELECTRA_SMALL`,保留**长度 ≥ 2 且词性以 `N` 开头**的词;非连续名词之间插入空格拼接成一条字符串(与 ES 侧 `keywords_query` 的用法一致)。 +- **其它非空语言码**:当前实现返回空串,即**不为该语种生成关键词子句**(由调用方决定是否跳过)。 + +### `collect_keywords_queries` + +- 键 **`base`**:对应 `rewritten_query` 的关键词;若调用方已预先计算 `base_keywords_query` 则直接写入,避免重复抽取。 +- 其它键:与 `translations` 中每个非空语种一一对应,语言码归一化为小写。 +- 全程可传入 `text_analysis_cache`,以便 `get_tokenizer_result` 命中缓存并与检测器共享分词结果。 + +常量 `KEYWORDS_QUERY_BASE_KEY` 的值为字符串 **`"base"`**,与检索构建里读取的字段一致。 + +--- + +## 英文关键词:`EnglishKeywordExtractor` + +- 依赖 **spaCy** 模型 **`en_core_web_sm`**,加载时关闭 `ner`、`textcat` 以减轻开销;加载失败时记录 warning 并走基于 `simple_tokenize_query` 的回退策略。 +- 主路径用依存句法与名词块规则收集一小组「核心词」候选(如直接宾语名词、部分 ROOT 名词/专有名词、INTJ 结构下的宾语等),并处理价格/目的介词宾语降级、人口学名词(如 `women`)弱化、尺寸类 ROOT 与主语搭配等边界情况。 +- 使用 `_project_terms_to_query_tokens` 将 spaCy 词形映射回查询中的**表面分词**(例如复合词 `t-shirt`),避免在关键词串中出现被错误切断的片段。 + +最终返回**最多三个词**的空格连接字符串,用于检索侧第二层 `combined_fields` 的紧凑查询(见下节)。 + +--- + +## 与检索层的关系(消费方摘要) + +`ParsedQuery.keywords_queries` 由 `search/es_query_builder.py` 读取:在构建某一语言的 lexical 子句时,除主 `combined_fields`(完整 `query`)外,若存在非空的 `keywords_query` 且与主查询不同,会追加第二个 `combined_fields`,使用单独的 `minimum_should_match`(由 builder 的 `keywords_minimum_should_match` 配置)和较低 boost,从而在**不替代全文查询**的前提下加强核心词匹配。 + +`query_tokens` 在同文件中间接影响例如带文本向量时的 KNN 分支参数(按 token 数量选用长查询的 k / num_candidates 等)。具体字段与 boost 以 `ESQueryBuilder` 当前实现为准。 + +--- + +## 样式意图与标题排除(简要) + +- **`StyleIntentRegistry` / `StyleIntentDetector`**:从 `QueryConfig.style_intent_terms` 等加载意图定义;`detect` 时按中英变体取查询文本,经 `tokenize_text` 或缓存得到 `TokenizedText`,在 `candidates` 上与配置同义词表匹配,输出 `StyleIntentProfile`(含 `query_variants` 与命中意图列表)。 +- **`ProductTitleExclusionRegistry` / `ProductTitleExclusionDetector`**:从 `QueryConfig.product_title_exclusion_rules` 加载规则;对 `original_query`、`query_normalized`、`rewritten_query` 及所有 `translations` 去重后分词匹配触发词,输出 `ProductTitleExclusionProfile`。 + +二者均依赖 `tokenization` 与可选的 HanLP,启用与否由配置项控制。 + +--- + +## 可观测性与调试 + +当 `QueryParser.parse(..., context=...)` 传入请求上下文时,典型中间结果包括: + +- `query_normalized`、`rewritten_query`、`detected_language`、`query_tokens` +- `translation_{lang}`、`translations` +- `keywords_queries` +- `query_vector_shape`、`image_query_vector_shape` +- `style_intent_profile`、`product_title_exclusion_profile` + +搜索主流程在 `search/searcher.py` 中会把解析结果写入 `QueryAnalysisResult`(含 **`keywords_queries`**),并在 `debug=true` 时把 `query_analysis` 挂到响应的 `debug_info`;前端调试页在 `frontend/static/js/app.js` 中展示 **Translations** 与 **Keywords Queries** 等块,便于与翻译结果并列查看。 + +--- + +## 依赖与环境提示 + +- **HanLP**:分词与中文词性标注;模型名以本文与源码为准(`FINE_ELECTRA_SMALL_ZH` + `CTB9_POS_ELECTRA_SMALL`)。 +- **spaCy**:英文关键词路径需要可导入的 **`en_core_web_sm`**(若缺失则英文关键词退化为轻量规则)。 +- **Lingua**:通用语言检测(在英文 ASCII 快路径不适用时参与拉丁语系判别)。 + +运行与测试时请使用项目约定的虚拟环境(见仓库根目录 `CLAUDE.md` / `activate.sh`),避免系统 Python 缺少上述依赖。 + +--- + +## 扩展与测试 + +- 单元测试中与解析、分词、意图相关的用例分布在 `tests/test_query_parser_mixed_language.py`、`tests/test_tokenization.py`、`tests/test_style_intent.py`、`tests/test_product_title_exclusion.py` 等文件中;修改分词或关键词策略时应同步更新或新增测试,以保持与本文描述一致。 + +若新增语种或改写语言检测策略,应同步审视:`QueryParser._detect_query_language`、`QueryTextAnalysisCache._should_use_model_tokenizer`、`KeywordExtractor.extract_keywords` 中非 `zh`/`en` 分支,以及 ES 侧是否应为新语种生成 `keywords_query`。 diff --git a/docs/搜索API对接指南-01-搜索接口.md b/docs/搜索API对接指南-01-搜索接口.md index e79c32f..44ab12e 100644 --- a/docs/搜索API对接指南-01-搜索接口.md +++ b/docs/搜索API对接指南-01-搜索接口.md @@ -24,6 +24,76 @@ headers = { response = requests.post(url, headers=headers, json={"query": "芭比娃娃"}) ``` +**cURL 示例(复制即可试)** + +将 `BASE_URL`、`TENANT_ID` 换成你的环境与租户。本地开发默认后端为 `http://localhost:6002`(见 `scripts/start_backend.sh`)。 + +```bash +export BASE_URL="${BASE_URL:-http://localhost:6002}" +export TENANT_ID="${TENANT_ID:-162}" # 改成真实租户 +``` + +**1)基础检索**:关键词、分页、`language` 控制返回字段语言。 + +```bash +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "芭比娃娃", + "size": 20, + "from": 0, + "language": "zh" + }' +``` + +**2)过滤 + 价格区间 + 分面 + 排序**:联调最常见组合(品牌 OR、`min_price` 区间、类目与颜色分面、按最低价升序)。 + +```bash +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 20, + "from": 0, + "language": "zh", + "filters": { + "vendor.zh.keyword": ["品牌A", "品牌B"] + }, + "range_filters": { + "min_price": {"gte": 50, "lte": 200} + }, + "facets": [ + {"field": "category1_name", "size": 15, "type": "terms"}, + {"field": "specifications.color", "size": 20, "type": "terms"} + ], + "sort_by": "min_price", + "sort_order": "asc" + }' +``` + +**3)规格过滤 + SKU 维度 + 调试**:`specifications` 嵌套过滤、`sku_filter_dimension` 每组只保留一个 SKU;`debug: true` 便于对照 ES 与解析结果。 + +```bash +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 10, + "from": 0, + "language": "zh", + "filters": { + "specifications": {"name": "color", "value": "white"} + }, + "sku_filter_dimension": ["color"], + "debug": true + }' +``` + +需要美化 JSON 输出时,在命令末尾追加 ` | jq .`(需本机安装 [jq](https://jqlang.github.io/jq/))。 + ### 3.2 请求参数 #### 完整请求体结构 @@ -897,5 +967,19 @@ response = requests.post(url, headers=headers, json={"query": "芭比娃娃"}) } ``` +**cURL(第 2 页,每页 20 条)**: + +```bash +curl -sS "${BASE_URL:-http://localhost:6002}/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: ${TENANT_ID:-162}" \ + -d '{ + "query": "手机", + "size": 20, + "from": 20, + "language": "zh" + }' +``` + --- diff --git a/docs/搜索API对接指南-速查表.md b/docs/搜索API对接指南-速查表.md index 20a358e..eaef281 100644 --- a/docs/搜索API对接指南-速查表.md +++ b/docs/搜索API对接指南-速查表.md @@ -1,56 +1,91 @@ # API 快速参考 (v3.0) +## 环境与 cURL + +- **默认地址**:`http://localhost:6002`(与仓库 `scripts/start_backend.sh` 一致)。 +- **租户**:必须使用请求头 **`X-Tenant-ID`**(推荐);一般不要依赖把 `tenant_id` 放在 URL 上。 +- 下列命令假设已设置: + +```bash +export BASE_URL="${BASE_URL:-http://localhost:6002}" +export TENANT_ID="${TENANT_ID:-163}" # 改成你的租户ID +``` + ## 基础搜索 ```bash -POST /search/ -{ - "query": "芭比娃娃", - "size": 20 -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "芭比娃娃", + "size": 20, + "from": 0, + "language": "zh" + }' ``` +等同请求:`POST /search/`,请求体含 `query`、`size`、`from`、`language`。 + --- ## 精确匹配过滤 +在 `POST /search/` 的 JSON 里使用 `filters`(下面示例与「完整示例」可对照字段含义): + ```bash -{ - "filters": { - "category_name": "手机", // 单值 - "category1_name": "服装", // 一级类目 - "vendor.zh.keyword": ["奇乐", "品牌A"], // 多值(OR) - "tags": "手机", // 标签 - // specifications 嵌套过滤 - "specifications": { - "name": "color", - "value": "white" +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 10, + "language": "zh", + "filters": { + "category_name": "手机", + "category1_name": "服装", + "vendor.zh.keyword": ["奇乐", "品牌A"], + "tags": "手机", + "specifications": {"name": "color", "value": "white"} } - } -} + }' ``` ### Specifications 过滤 -**单个规格**: +**单个规格**: + ```bash -{ - "filters": { - "specifications": {"name": "color", "value": "white"} - } -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 10, + "language": "zh", + "filters": { + "specifications": {"name": "color", "value": "white"} + } + }' ``` -**多个规格(按维度分组)**: +**多个规格(不同 name 为 AND)**: + ```bash -{ - "filters": { - "specifications": [ - {"name": "color", "value": "white"}, - {"name": "size", "value": "256GB"} - ] - } -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 10, + "language": "zh", + "filters": { + "specifications": [ + {"name": "color", "value": "white"}, + {"name": "size", "value": "256GB"} + ] + } + }' ``` 说明:不同维度(不同name)是AND关系,相同维度(相同name)的多个值是OR关系。 @@ -59,14 +94,17 @@ POST /search/ ## 范围过滤 ```bash -{ - "range_filters": { - "min_price": { - "gte": 50, // >= - "lte": 200 // <= +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "玩具", + "size": 20, + "language": "zh", + "range_filters": { + "min_price": {"gte": 50, "lte": 200} } - } -} + }' ``` **操作符**: `gte` (>=), `gt` (>), `lte` (<=), `lt` (<) @@ -78,47 +116,73 @@ POST /search/ ### 简单模式 ```bash -{ - "facets": ["category1_name", "category2_name", "category3_name", "specifications"] -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "玩具", + "size": 20, + "language": "zh", + "facets": ["category1_name", "category2_name", "category3_name", "specifications"] + }' ``` ### Specifications 分面 -**所有规格名称**: +**所有规格名称**: + ```bash -{ - "facets": ["specifications"] // 返回所有name及其value列表 -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 10, + "language": "zh", + "facets": ["specifications"] + }' ``` -**指定规格名称**: +**指定规格名称**: + ```bash -{ - "facets": ["specifications.color", "specifications.size", "specifications.material"] // 只返回指定name的value列表 -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 10, + "language": "zh", + "facets": ["specifications.color", "specifications.size", "specifications.material"] + }' ``` ### 高级模式 ```bash -{ - "facets": [ - {"field": "category1_name", "size": 15}, - { - "field": "min_price", - "type": "range", - "ranges": [ - {"key": "0-50", "to": 50}, - {"key": "50-100", "from": 50, "to": 100} - ] - }, - "specifications", // 所有规格名称 - "specifications.color", // 指定规格名称 - "specifications.size", - "specifications.material" - ] -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 20, + "language": "zh", + "facets": [ + {"field": "category1_name", "size": 15}, + { + "field": "min_price", + "type": "range", + "ranges": [ + {"key": "0-50", "to": 50}, + {"key": "50-100", "from": 50, "to": 100} + ] + }, + "specifications", + "specifications.color", + "specifications.size", + "specifications.material" + ] + }' ``` --- @@ -128,12 +192,19 @@ POST /search/ **功能**: 按指定维度对每个SPU下的SKU进行分组,每组只返回第一个SKU。 ```bash -{ - "query": "芭比娃娃", - "sku_filter_dimension": "color" // 按颜色筛选(假设option1_name="color") -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "芭比娃娃", + "size": 20, + "language": "zh", + "sku_filter_dimension": ["color"] + }' ``` +(`sku_filter_dimension` 为数组;按颜色筛选时需索引里 `option*_name` 与维度一致,见正文说明。) + **支持的维度值**: - `option1`, `option2`, `option3`: 直接使用选项字段 - 规格名称(如 `color`, `size`): 通过 `option1_name`、`option2_name`、`option3_name` 匹配 @@ -157,10 +228,16 @@ POST /search/ ## 排序 ```bash -{ - "sort_by": "min_price", - "sort_order": "asc" // asc 或 desc -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "玩具", + "size": 20, + "language": "zh", + "sort_by": "min_price", + "sort_order": "asc" + }' ``` --- @@ -168,46 +245,57 @@ POST /search/ ## 分页 ```bash -{ - "size": 20, // 每页数量 - "from": 0 // 偏移量(第1页=0,第2页=20) -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 20, + "from": 20, + "language": "zh" + }' ``` +(第 1 页 `from: 0`,第 2 页 `from: 20`,以此类推。) + --- ## 完整示例 +一键联调:过滤 + 区间 + 分面 + 排序 + SKU 维度(请按实际类目/规格改 `filters`)。 + ```bash -POST /search/ -Headers: X-Tenant-ID: 162 -{ - "query": "手机", - "size": 20, - "from": 0, - "language": "zh", - "filters": { - "category_name": "手机", - "category1_name": "电子产品", - "specifications": {"name": "color", "value": "white"} - }, - "range_filters": { - "min_price": {"gte": 50, "lte": 200} - }, - "facets": [ - {"field": "category1_name", "size": 15}, - {"field": "category2_name", "size": 15}, - "specifications.color", - "specifications.size" - ], - "sort_by": "min_price", - "sort_order": "asc", - "sku_filter_dimension": "color" // 可选:按颜色筛选SKU -} +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "手机", + "size": 20, + "from": 0, + "language": "zh", + "filters": { + "category_name": "手机", + "category1_name": "电子产品", + "specifications": {"name": "color", "value": "white"} + }, + "range_filters": { + "min_price": {"gte": 50, "lte": 200} + }, + "facets": [ + {"field": "category1_name", "size": 15}, + {"field": "category2_name", "size": 15}, + "specifications.color", + "specifications.size" + ], + "sort_by": "min_price", + "sort_order": "asc", + "sku_filter_dimension": ["color"] + }' ``` --- + ## 响应格式 ```json @@ -275,26 +363,60 @@ Headers: X-Tenant-ID: 162 ## 其他端点 +**图搜**(`POST /search/image`,与文本搜一样带 `X-Tenant-ID`): + ```bash -POST /search/image -{ - "image_url": "https://example.com/image.jpg", - "size": 20 -} +curl -sS "$BASE_URL/search/image" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "image_url": "https://example.com/product.jpg", + "size": 20, + "range_filters": { + "min_price": {"gte": 10, "lte": 500} + } + }' +``` + +(图搜请求体仅支持 `image_url`、`size`、可选的 `filters` / `range_filters`;与文本搜的 `language` / `from` 无关。) + +**搜索建议**: + +```bash +curl -sS -G "$BASE_URL/search/suggestions" \ + --data-urlencode "q=芭" \ + --data-urlencode "size=5" \ + --data-urlencode "language=zh" \ + -H "X-Tenant-ID: $TENANT_ID" +``` -GET /search/suggestions?q=芭&size=5&language=zh +**即时搜索**(当前实现返回 501,勿用于生产): -GET /search/instant?q=玩具&size=5 # 当前返回 501 Not Implemented +```bash +curl -sS -G "$BASE_URL/search/instant" \ + --data-urlencode "q=玩具" \ + --data-urlencode "size=5" \ + -H "X-Tenant-ID: $TENANT_ID" +``` -GET /search/{doc_id} +**按文档 ID 取详情**(将 `YOUR_SPU_ID` 换成真实 `spu_id`): -GET /admin/health -GET /admin/config -GET /admin/stats # 需 X-Tenant-ID 或 tenant_id +```bash +curl -sS "$BASE_URL/search/YOUR_SPU_ID" \ + -H "X-Tenant-ID: $TENANT_ID" +``` + +**运维**: + +```bash +curl -sS "$BASE_URL/admin/health" +curl -sS "$BASE_URL/admin/config" +curl -sS "$BASE_URL/admin/stats" -H "X-Tenant-ID: $TENANT_ID" ``` --- + ## Python 快速示例 ```python diff --git a/query/README.md b/query/README.md new file mode 100644 index 0000000..e0dba73 --- /dev/null +++ b/query/README.md @@ -0,0 +1,251 @@ +# Query 模块说明 + +本目录实现搜索请求侧的**查询理解与解析**:在不做 Elasticsearch 语言计划拼装的前提下,产出可供检索层、重排层与调试界面消费的**结构化事实**(规范化文本、检测语言、可选翻译、文本与 CLIP 向量、分词与关键词、可选的样式意图与标题排除配置等)。下面按**当前实现**说明策略与数据流,便于与 `search/`、`context/`、`frontend/` 对照阅读。 + +--- + +## 包内文件与职责 + +| 文件 | 作用 | +|------|------| +| `query_parser.py` | 入口 `QueryParser`:编排规范化、改写、语言检测、异步翻译与向量、分词、关键词、意图与排除检测;定义 `ParsedQuery`。 | +| `tokenization.py` | 轻量分词、文本规范化、`TokenizedText` 与按请求复用的 `QueryTextAnalysisCache`(模型分词与语言提示、粗细分词策略)。 | +| `keyword_extractor.py` | `KeywordExtractor`:中文走 HanLP 分词 + 词性名词串;英文走 spaCy 核心词;`collect_keywords_queries` 汇总 `base` 与各翻译语种。 | +| `english_keyword_extractor.py` | `EnglishKeywordExtractor`:`en_core_web_sm` + 依存/名词块规则,产出短字符串供检索侧关键词子句使用。 | +| `language_detector.py` | 脚本优先 + Lingua 的通用语言检测(与 `QueryParser` 的英文 ASCII 快路径配合使用)。 | +| `query_rewriter.py` | 基于配置词典的查询改写与规范化。 | +| `style_intent.py` | 从配置加载样式意图词表,对查询变体做候选匹配,产出 `StyleIntentProfile`。 | +| `product_title_exclusion.py` | 从配置加载标题排除规则,对多路查询文本做触发词匹配,产出 `ProductTitleExclusionProfile`。 | + +公开符号见 `query/__init__.py`(`QueryParser`、`ParsedQuery`、`KEYWORDS_QUERY_BASE_KEY` 等)。 + +--- + +## 解析产物:`ParsedQuery` + +`ParsedQuery` 是单次 `parse()` 的权威结果容器,字段含义与下游约定如下。 + +- **`original_query` / `query_normalized` / `rewritten_query`**:分别为原始输入、规范化后、词典改写后的主查询文本;后续翻译、向量、默认分词与 `base` 关键词均以**改写后的 `rewritten_query`(在代码变量中常名为 `query_text`)**为基准。 +- **`detected_language`**:解析时认定的源语言代码;若检测为 `unknown` 或空,则回退到 `SearchConfig.query_config.default_language`。 +- **`translations`**:键为**目标语言代码**(如 `zh`、`en`),值为翻译服务返回的字符串;仅包含本次请求实际需要的目标语种(见下文翻译目标推导)。 +- **`query_vector` / `image_query_vector`**:分别为 BGE 类文本向量与 CLIP 文本向量(维度由各自编码服务决定);未生成或未在超时内完成则为 `None`。 +- **`query_tokens`**:对**改写后主查询**做分词后的字符串列表,供例如 KNN 参数按 token 数分支等逻辑使用;分词路径由 `QueryTextAnalysisCache` 决定(纯拉丁英文可走轻量分词,含汉字则走 HanLP)。 +- **`keywords_queries`**:与「主查询 + 各翻译变体」平行的**关键词子查询**映射:键 `base`(常量 `KEYWORDS_QUERY_BASE_KEY`)对应源语言侧关键词串,其它键与 `translations` 的语种键一致。空串或无法提取的条目**不会写入**字典。 +- **`style_intent_profile` / `product_title_exclusion_profile`**:可选的理解结果;是否生效完全由 `config.yaml` 中 `query_config` 的对应开关与词表/规则决定。 +- **`_text_analysis_cache`**:单次解析内的分词与语言提示缓存,**不参与序列化**,仅供同一次 `parse` 内各检测器复用,避免对同一文本重复调用 HanLP。 + +与重排相关的文本选择由独立函数 `rerank_query_text()` 完成:检测为 `zh` 或 `en` 时始终用原始查询;其它语言优先英译再中译,见 `query_parser.py` 中实现。 + +--- + +## `QueryParser.parse()` 的执行顺序与策略 + +解析主流程在 `QueryParser.parse()` 中实现。整体目标是:在**共享等待预算**下并行完成翻译与向量请求,同时尽量减少主线程上重复、昂贵的分词与 NLP 调用,并把结果写入可选的 `context`(请求上下文)供日志与 `debug_info` 使用。 + +### 1. 规范化与改写 + +- 使用 `QueryNormalizer` 得到 `query_normalized` 并可选写入 `context.store_intermediate_result('query_normalized', ...)`。 +- 若配置了改写词典,则用 `QueryRewriter` 可能更新 `query_text`;改写成功时记录 `rewritten_query` 与告警。 + +### 2. 语言检测:通用路径与英文 ASCII 快路径 + +- **快路径**:当「活跃语种集合」仅为 `en` 与 `zh` 的子集时(活跃集合取 `target_languages` 归一化结果,若为空则回退到 `query_config.supported_languages`),且当前查询为**纯 ASCII、含字母、不含汉字**,则**直接判定为 `en`**,不再调用 `LanguageDetector`(避免 Lingua 等开销)。逻辑见 `_detect_query_language()` 与 `_is_ascii_latin_query()`。 + +```303:317:query/query_parser.py + def _detect_query_language( + self, + query_text: str, + *, + target_languages: Optional[List[str]] = None, + ) -> str: + normalized_targets = self._normalize_language_codes(target_languages) + supported_languages = self._normalize_language_codes( + getattr(self.config.query_config, "supported_languages", None) + ) + active_languages = normalized_targets or supported_languages + if active_languages and set(active_languages).issubset({"en", "zh"}): + if self._is_ascii_latin_query(query_text): + return "en" + return self.language_detector.detect(query_text) +``` + +- **通用路径**:`LanguageDetector` 先按 Unicode 脚本返回明确语种(如汉字块即 `zh`),否则用 Lingua 在一大组语言中判别,见 `language_detector.py`。 + +检测最终结果写入 `context.store_intermediate_result('detected_language', ...)`(若提供 `context`)。 + +### 3. 按请求分词缓存与语言提示 + +每次 `parse` 会新建 `QueryTextAnalysisCache(tokenizer=self._tokenizer)`,并对**原始串、规范化串、改写后串**调用 `set_language_hint(..., detected_lang)`,使后续对同一文本的 `get_tokenizer_result` / `get_tokenized_text` 能按语言选择**是否调用 HanLP**。 + +### 4. HanLP 模型(与 `KeywordExtractor` 对齐) + +`QueryParser` 默认构建的 `self._tokenizer` 为 HanLP 预训练分词模型 **`FINE_ELECTRA_SMALL_ZH`**,并开启 `output_spans=True`,以便与关键词提取共用「带偏移的分词结果」。 + +```237:245:query/query_parser.py + def _build_tokenizer(self) -> Callable[[str], Any]: + """Build the tokenizer used by query parsing. No fallback path by design.""" + if hanlp is None: + raise RuntimeError("HanLP is required for QueryParser tokenization") + logger.info("Initializing HanLP tokenizer...") + tokenizer = hanlp.load(hanlp.pretrained.tok.FINE_ELECTRA_SMALL_ZH) + tokenizer.config.output_spans = True + logger.info("HanLP tokenizer initialized") + return tokenizer +``` + +`KeywordExtractor` 在未注入自定义 `tokenizer` 时同样加载 **`FINE_ELECTRA_SMALL_ZH`**,并额外加载 **`CTB9_POS_ELECTRA_SMALL`** 做词性标注;二者在「中文路径」上语义一致,便于复用 `tokenizer_result`。 + +### 5. 异步富集:翻译、文本向量、CLIP 文本向量 + +- 翻译目标:`translation_targets = normalized_targets` 中**去掉与检测源语言相同**的代码后的列表(例如源为 `en` 且索引语言为 `["en","zh"]` 时只翻 `zh`)。 +- 翻译模型名:由 `_pick_query_translation_model()` 根据「源语言是否在索引语言内」及 `zh↔en` 等分支从 `QueryConfig` 选取。 +- 当 `generate_vector` 为真且配置开启文本嵌入时,向线程池提交 `text_encoder.encode([query_text], ...)`;当配置了 `image_embedding_field` 时提交 `image_encoder.encode_clip_text(query_text, ...)`。 +- 线程池:`ThreadPoolExecutor`,`max_workers` 为 `min(任务数, 4)` 与至少 1。 +- **提交顺序**:先尽可能提交所有异步任务,再在主线程上做「与异步重叠」的轻量工作(见下一节),最后 `concurrent.futures.wait(..., timeout=budget_sec)`。超时未完成的任务会记 warning,并 `shutdown(wait=False)` 不阻塞关闭线程池。 + +等待预算(毫秒)来自 `QueryConfig`: + +- 源语言在索引语言内:`translation_embedding_wait_budget_ms_source_in_index` +- 否则:`translation_embedding_wait_budget_ms_source_not_in_index` + +完成每个 future 后打 `Async enrichment task finished` 日志(含 `elapsed_ms`,为从提交到完成的大致墙钟时间)。 + +### 6. 主查询分词与「base」关键词(与异步重叠) + +在异步任务已提交之后、`wait()` 之前,当前实现会: + +1. 通过 `text_analysis_cache.get_tokenizer_result(query_text)` 得到分词结果,再 `extract_token_strings` 得到 **`query_tokens`**; +2. 调用 `KeywordExtractor.extract_keywords(query_text, language_hint=detected_lang, tokenizer_result=...)` 得到 **`keywords_base_query`**(若失败则日志告警,base 关键词可能为空)。 + +这样主线程在等翻译/向量时,已并行完成源侧分词与源侧关键词的大部分工作。 + +### 7. 等待结束后的关键词汇总与检测器 + +`wait()` 返回后: + +- 若有翻译结果,写入 `context.store_intermediate_result("translations", translations)`,并对每条翻译 `text_analysis_cache.set_language_hint(result, lang)`,保证后续对该翻译串的分词/关键词走正确语言路径。 +- `collect_keywords_queries(...)` 合并 **`base`**(可传入已算好的 `base_keywords_query` 避免重复抽取)与各翻译语种的关键词,得到 **`keywords_queries`**;成功时 `context.store_intermediate_result("keywords_queries", keywords_queries)` 并打 `Keyword extraction completed` 日志。 +- 构造带 `_text_analysis_cache` 的 `ParsedQuery` 草稿,依次调用 `StyleIntentDetector.detect` 与 `ProductTitleExclusionDetector.detect`,再把完整 `ParsedQuery` 返回。 + +解析阶段会打聚合耗时日志 `Query parse stage timings`,字段含义为: + +- **`before_wait_ms`**:从解析开始计时点到进入 `wait()` 之前的主线程耗时(含规范化、改写、语言检测、提交异步任务、主查询分词、base 关键词等); +- **`async_wait_ms`**:`wait()` 阻塞时间; +- **`base_keywords_ms`**:base 关键词抽取耗时; +- **`keyword_tail_ms`**:`collect_keywords_queries` 及前后尾部逻辑中关键词相关部分的主要耗时; +- **`tail_sync_ms`**:`wait()` 之后整段同步尾巴(含关键词汇总、两检测器、写中间结果等)。 + +--- + +## 分词与 `QueryTextAnalysisCache` + +### `get_tokenizer_result`:何时走 HanLP,何时走轻量分词 + +- 若未配置模型 `tokenizer`,直接返回空列表路径的轻量结果(由上层避免依赖)。 +- 若根据**该文本的语言提示**与**是否含汉字**判断不需要模型:返回 `simple_tokenize_query` 的列表(字符串 token),**不调用 HanLP**。 +- 否则对该文本调用一次 `self.tokenizer(text)`(HanLP),结果按文本缓存,同一次 `parse` 内重复访问同一字符串不会重复推理。 + +核心判断在 `_should_use_model_tokenizer`:**语言提示为 `zh` 时,仅当文本含汉字才用模型**;非 `zh` 提示时,仅当文本含汉字才用模型。因此纯英文主查询在提示为 `en` 时走轻量分词;中文翻译串在 `set_language_hint(..., "zh")` 且含汉字时走 HanLP。 + +### `coarse_tokens` 与 `fine_tokens`:`TokenizedText` + +- **`fine_tokens`**:来自 `extract_token_strings(get_tokenizer_result(...))`,在中文路径上即 HanLP 分词后的词串(已按规范化键去重保序)。 +- **`coarse_tokens`**:由 `_build_coarse_tokens` 决定。若语言提示为 **`zh`**,或文本含汉字且已有 `tokenizer_tokens`,则 **粗粒度 token 与 fine 一致**(即采用模型分词粒度,而不用「整段 CJK 连成一项」的纯正则策略)。否则使用 `simple_tokenize_query`(适合拉丁词、数字、带连字符/撇号的英文词形)。 + +```92:103:query/tokenization.py +def _build_coarse_tokens( + text: str, + *, + language_hint: Optional[str], + tokenizer_tokens: Sequence[str], +) -> List[str]: + normalized_language = normalize_query_text(language_hint) + if normalized_language == "zh" or (contains_han_text(text) and tokenizer_tokens): + # Chinese coarse tokenization should follow the model tokenizer rather than a + # regex that collapses the whole sentence into one CJK span. + return list(_dedupe_preserve_order(tokenizer_tokens)) + return _dedupe_preserve_order(simple_tokenize_query(text)) +``` + +- **`candidates`**:在 fine、coarse、两类 n-gram 短语(上限由 `max_ngram` 控制)以及整句 `normalized_text` 上合并去重,供 `StyleIntentDetector`、`ProductTitleExclusionDetector` 等做子串/短语级匹配。 + +`tokenize_text()` 是对单次无缓存场景的薄封装:内部新建 `QueryTextAnalysisCache` 再 `get_tokenized_text`。 + +--- + +## 关键词提取:`KeywordExtractor` 与 `collect_keywords_queries` + +### 路由规则 + +`extract_keywords` 根据 `language_hint` 分支: + +- **`en`**:完全交给 `EnglishKeywordExtractor`(spaCy),**不使用** HanLP 分词结果做 POS 名词筛选(即使调用方传入 `tokenizer_result` 也会被忽略在该路径内)。 +- **`zh`**:使用 HanLP 分词结果(优先复用传入的 `tokenizer_result`),再对词序列跑 `CTB9_POS_ELECTRA_SMALL`,保留**长度 ≥ 2 且词性以 `N` 开头**的词;非连续名词之间插入空格拼接成一条字符串(与 ES 侧 `keywords_query` 的用法一致)。 +- **其它非空语言码**:当前实现返回空串,即**不为该语种生成关键词子句**(由调用方决定是否跳过)。 + +### `collect_keywords_queries` + +- 键 **`base`**:对应 `rewritten_query` 的关键词;若调用方已预先计算 `base_keywords_query` 则直接写入,避免重复抽取。 +- 其它键:与 `translations` 中每个非空语种一一对应,语言码归一化为小写。 +- 全程可传入 `text_analysis_cache`,以便 `get_tokenizer_result` 命中缓存并与检测器共享分词结果。 + +常量 `KEYWORDS_QUERY_BASE_KEY` 的值为字符串 **`"base"`**,与检索构建里读取的字段一致。 + +--- + +## 英文关键词:`EnglishKeywordExtractor` + +- 依赖 **spaCy** 模型 **`en_core_web_sm`**,加载时关闭 `ner`、`textcat` 以减轻开销;加载失败时记录 warning 并走基于 `simple_tokenize_query` 的回退策略。 +- 主路径用依存句法与名词块规则收集一小组「核心词」候选(如直接宾语名词、部分 ROOT 名词/专有名词、INTJ 结构下的宾语等),并处理价格/目的介词宾语降级、人口学名词(如 `women`)弱化、尺寸类 ROOT 与主语搭配等边界情况。 +- 使用 `_project_terms_to_query_tokens` 将 spaCy 词形映射回查询中的**表面分词**(例如复合词 `t-shirt`),避免在关键词串中出现被错误切断的片段。 + +最终返回**最多三个词**的空格连接字符串,用于检索侧第二层 `combined_fields` 的紧凑查询(见下节)。 + +--- + +## 与检索层的关系(消费方摘要) + +`ParsedQuery.keywords_queries` 由 `search/es_query_builder.py` 读取:在构建某一语言的 lexical 子句时,除主 `combined_fields`(完整 `query`)外,若存在非空的 `keywords_query` 且与主查询不同,会追加第二个 `combined_fields`,使用单独的 `minimum_should_match`(由 builder 的 `keywords_minimum_should_match` 配置)和较低 boost,从而在**不替代全文查询**的前提下加强核心词匹配。 + +`query_tokens` 在同文件中间接影响例如带文本向量时的 KNN 分支参数(按 token 数量选用长查询的 k / num_candidates 等)。具体字段与 boost 以 `ESQueryBuilder` 当前实现为准。 + +--- + +## 样式意图与标题排除(简要) + +- **`StyleIntentRegistry` / `StyleIntentDetector`**:从 `QueryConfig.style_intent_terms` 等加载意图定义;`detect` 时按中英变体取查询文本,经 `tokenize_text` 或缓存得到 `TokenizedText`,在 `candidates` 上与配置同义词表匹配,输出 `StyleIntentProfile`(含 `query_variants` 与命中意图列表)。 +- **`ProductTitleExclusionRegistry` / `ProductTitleExclusionDetector`**:从 `QueryConfig.product_title_exclusion_rules` 加载规则;对 `original_query`、`query_normalized`、`rewritten_query` 及所有 `translations` 去重后分词匹配触发词,输出 `ProductTitleExclusionProfile`。 + +二者均依赖 `tokenization` 与可选的 HanLP,启用与否由配置项控制。 + +--- + +## 可观测性与调试 + +当 `QueryParser.parse(..., context=...)` 传入请求上下文时,典型中间结果包括: + +- `query_normalized`、`rewritten_query`、`detected_language`、`query_tokens` +- `translation_{lang}`、`translations` +- `keywords_queries` +- `query_vector_shape`、`image_query_vector_shape` +- `style_intent_profile`、`product_title_exclusion_profile` + +搜索主流程在 `search/searcher.py` 中会把解析结果写入 `QueryAnalysisResult`(含 **`keywords_queries`**),并在 `debug=true` 时把 `query_analysis` 挂到响应的 `debug_info`;前端调试页在 `frontend/static/js/app.js` 中展示 **Translations** 与 **Keywords Queries** 等块,便于与翻译结果并列查看。 + +--- + +## 依赖与环境提示 + +- **HanLP**:分词与中文词性标注;模型名以本文与源码为准(`FINE_ELECTRA_SMALL_ZH` + `CTB9_POS_ELECTRA_SMALL`)。 +- **spaCy**:英文关键词路径需要可导入的 **`en_core_web_sm`**(若缺失则英文关键词退化为轻量规则)。 +- **Lingua**:通用语言检测(在英文 ASCII 快路径不适用时参与拉丁语系判别)。 + +运行与测试时请使用项目约定的虚拟环境(见仓库根目录 `CLAUDE.md` / `activate.sh`),避免系统 Python 缺少上述依赖。 + +--- + +## 扩展与测试 + +- 单元测试中与解析、分词、意图相关的用例分布在 `tests/test_query_parser_mixed_language.py`、`tests/test_tokenization.py`、`tests/test_style_intent.py`、`tests/test_product_title_exclusion.py` 等文件中;修改分词或关键词策略时应同步更新或新增测试,以保持与本文描述一致。 + +若新增语种或改写语言检测策略,应同步审视:`QueryParser._detect_query_language`、`QueryTextAnalysisCache._should_use_model_tokenizer`、`KeywordExtractor.extract_keywords` 中非 `zh`/`en` 分支,以及 ES 侧是否应为新语种生成 `keywords_query`。 diff --git a/scripts/es_debug_search.py b/scripts/es_debug_search.py deleted file mode 100644 index e8e4989..0000000 --- a/scripts/es_debug_search.py +++ /dev/null @@ -1,543 +0,0 @@ -#!/usr/bin/env python3 -""" -Interactive Elasticsearch debug search (standalone; not part of the main API). - -Flow: query → mode 1–5 → 选择显示列 (默认全选 title.zh/en, qanchors.zh/en, tags) → 条数 → 表格。 - -文本检索 (1–3) 使用 ES highlight,终端内红色 (ANSI) 标记匹配片段。 - -mode 5(image_embedding):图片 URL/本地路径走 POST /embed/image(6008);纯文本走 clip-as-service -gRPC(与 `embedding.image_backends.clip_as_service` 一致),不下载本地 CN-CLIP。若仅配置 -local_cnclip,请改用 clip_as_service 或只输入图片 URL。 - -Usage: - source activate.sh - python scripts/es_debug_search.py [--tenant-id ID] [--index NAME] -""" - -from __future__ import annotations - -import argparse -import curses -import re -import shutil -import sys -from pathlib import Path -from typing import Any, Callable, Dict, List, Sequence, Tuple - -PROJECT_ROOT = Path(__file__).resolve().parents[1] -if str(PROJECT_ROOT) not in sys.path: - sys.path.insert(0, str(PROJECT_ROOT)) - -OPTIONS: Sequence[tuple[str, str]] = ( - ("title", "title.zh / title.en"), - ("qanchors", "qanchors.zh / qanchors.en"), - ("tags", "tags (keyword)"), - ("title_embedding", "KNN: title_embedding (text service)"), - ("image_embedding", "KNN: image_embedding (HTTP 图 / grpc 文本)"), -) - -# 列定义:(列 id, 表头短名) -COLUMN_DEFS: Tuple[Tuple[str, str], ...] = ( - ("title.zh", "title.zh"), - ("title.en", "title.en"), - ("qanchors.zh", "qanchors.zh"), - ("qanchors.en", "qanchors.en"), - ("tags", "tags"), -) - -# 文本检索模式使用的 highlight 字段 -HIGHLIGHT_FIELDS_BY_MODE: Dict[int, List[str]] = { - 1: ["title.zh", "title.en"], - 2: ["qanchors.zh", "qanchors.en"], - 3: ["tags"], -} - -ANSI_RE = re.compile(r"\x1b\[[0-9;]*m") - - -def _strip_ansi(s: str) -> str: - return ANSI_RE.sub("", s) - - -def _visible_len(s: str) -> int: - return len(_strip_ansi(s)) - - -def _truncate(s: str, max_len: int) -> str: - if max_len <= 0: - return "" - if _visible_len(s) <= max_len: - return s - # 在纯文本长度上截断(忽略 ANSI 近似按字符截断) - plain = _strip_ansi(s) - if len(plain) <= max_len: - return s - return plain[: max_len - 1] + "…" - - -def _lang_field(source: Dict[str, Any], obj_key: str, lang: str) -> str: - obj = source.get(obj_key) - if isinstance(obj, dict): - return str(obj.get(lang) or "").strip() - if obj is None: - return "" - return str(obj).strip() - - -def _tags_str(source: Dict[str, Any]) -> str: - raw = source.get("tags") - if raw is None: - return "" - if isinstance(raw, list): - return ", ".join(str(x) for x in raw if x is not None) - return str(raw).strip() - - -def _cell_from_hit(hit: Dict[str, Any], field_id: str, source: Dict[str, Any]) -> str: - """优先使用 highlight,否则 _source。""" - hl = hit.get("highlight") or {} - if field_id in hl: - parts = hl[field_id] - if isinstance(parts, list): - if field_id == "tags": - return ", ".join(parts) - return parts[0] if parts else "" - return str(parts) - if field_id == "title.zh": - return _lang_field(source, "title", "zh") - if field_id == "title.en": - return _lang_field(source, "title", "en") - if field_id == "qanchors.zh": - return _lang_field(source, "qanchors", "zh") - if field_id == "qanchors.en": - return _lang_field(source, "qanchors", "en") - if field_id == "tags": - return _tags_str(source) - return "" - - -def _highlight_clause(field_names: Sequence[str]) -> Dict[str, Any]: - return { - "require_field_match": True, - "pre_tags": ["\x1b[31m"], - "post_tags": ["\x1b[0m"], - "fields": { - f: {"number_of_fragments": 0, "fragment_size": 8000} for f in field_names - }, - } - - -def _source_includes() -> List[str]: - return ["title", "qanchors", "tags", "spu_id"] - - -def _select_mode_curses() -> int: - labels = [f"{key} — {desc}" for key, desc in OPTIONS] - - def run(stdscr: Any) -> int: - curses.curs_set(0) - stdscr.keypad(True) - current = 0 - while True: - stdscr.erase() - stdscr.addstr( - 0, - 0, - "选择模式 (↑↓ 移动, Enter 确认; 默认第一项 title)", - curses.A_BOLD, - ) - for i, line in enumerate(labels): - attr = curses.A_REVERSE if i == current else curses.A_NORMAL - prefix = ">" if i == current else " " - stdscr.addstr(2 + i, 0, f"{prefix} {i + 1}. {line}", attr) - stdscr.refresh() - ch = stdscr.getch() - if ch in (curses.KEY_UP, ord("k")): - current = (current - 1) % len(labels) - elif ch in (curses.KEY_DOWN, ord("j")): - current = (current + 1) % len(labels) - elif ch in (10, 13): - return current + 1 - elif ch in (27,): - return 1 - - return int(curses.wrapper(run)) - - -def _select_mode_fallback() -> int: - print("选择模式 (直接回车 = 1 title):") - for i, (_k, desc) in enumerate(OPTIONS, 1): - print(f" {i}. {desc}") - raw = input("编号 [1]: ").strip() or "1" - try: - n = int(raw) - except ValueError: - n = 1 - return max(1, min(n, len(OPTIONS))) - - -def _select_mode() -> int: - if not sys.stdin.isatty(): - return _select_mode_fallback() - try: - return _select_mode_curses() - except Exception: - return _select_mode_fallback() - - -def _select_fields_curses() -> List[str]: - """返回选中的列 id 列表(顺序与 COLUMN_DEFS 一致)。""" - ids = [c[0] for c in COLUMN_DEFS] - labels = [c[1] for c in COLUMN_DEFS] - selected = [True] * len(ids) - - def run(stdscr: Any) -> List[str]: - curses.curs_set(0) - stdscr.keypad(True) - cur = 0 - while True: - stdscr.erase() - stdscr.addstr( - 0, - 0, - "选择显示列 (空格切换, Enter 确认; 默认全选)", - curses.A_BOLD, - ) - stdscr.addstr(1, 0, "a: 全选 / n: 全不选", curses.A_DIM) - row = 3 - for i, lab in enumerate(labels): - mark = "[x]" if selected[i] else "[ ]" - attr = curses.A_REVERSE if i == cur else curses.A_NORMAL - stdscr.addstr(row + i, 0, f"{mark} {lab}", attr) - stdscr.refresh() - ch = stdscr.getch() - if ch in (curses.KEY_UP, ord("k")): - cur = (cur - 1) % len(ids) - elif ch in (curses.KEY_DOWN, ord("j")): - cur = (cur + 1) % len(ids) - elif ch in (32,): # space - selected[cur] = not selected[cur] - elif ch in (ord("a"), ord("A")): - for j in range(len(selected)): - selected[j] = True - elif ch in (ord("n"), ord("N")): - for j in range(len(selected)): - selected[j] = False - elif ch in (10, 13): - if not any(selected): - for j in range(len(selected)): - selected[j] = True - return [ids[i] for i in range(len(ids)) if selected[i]] - elif ch in (27,): - return list(ids) - - return curses.wrapper(run) - - -def _select_fields_fallback() -> List[str]: - print("显示列 (编号 1-5 逗号分隔; 回车=全选):") - for i, (cid, lab) in enumerate(COLUMN_DEFS, 1): - print(f" {i}. {lab}") - raw = input("列 [1,2,3,4,5]: ").strip() - if not raw: - return [c[0] for c in COLUMN_DEFS] - out: List[str] = [] - for part in raw.replace(",", ",").split(","): - part = part.strip() - if not part: - continue - try: - n = int(part) - except ValueError: - continue - if 1 <= n <= len(COLUMN_DEFS): - cid = COLUMN_DEFS[n - 1][0] - if cid not in out: - out.append(cid) - return out if out else [c[0] for c in COLUMN_DEFS] - - -def _select_fields() -> List[str]: - if not sys.stdin.isatty(): - return _select_fields_fallback() - try: - return _select_fields_curses() - except Exception: - return _select_fields_fallback() - - -def _ordered_columns(selected: List[str]) -> List[str]: - """按 COLUMN_DEFS 顺序输出选中的列。""" - id_set = set(selected) - return [c[0] for c in COLUMN_DEFS if c[0] in id_set] - - -def _run_es( - es: Any, - index_name: str, - body: Dict[str, Any], - size: int, -) -> List[Dict[str, Any]]: - # Avoid passing size= alongside body= (deprecated in elasticsearch-py). - payload = {**body, "size": size} - resp = es.search(index=index_name, body=payload) - if hasattr(resp, "body"): - payload = resp.body - else: - payload = dict(resp) if not isinstance(resp, dict) else resp - hits = (payload.get("hits") or {}).get("hits") or [] - return hits - - -def _print_table( - hits: List[Dict[str, Any]], - columns: List[str], - *, - term_width: int, -) -> None: - """简单 Unicode 表格:#、doc_id、所选列。""" - if not columns: - columns = [c[0] for c in COLUMN_DEFS] - - headers = ["#", "doc_id"] + [next(h for cid, h in COLUMN_DEFS if cid == col) for col in columns] - - rows: List[List[str]] = [] - for i, hit in enumerate(hits, 1): - sid = str(hit.get("_id", "")) - src = hit.get("_source") or {} - cells = [str(i), sid] - for col in columns: - cells.append(_cell_from_hit(hit, col, src)) - rows.append(cells) - - # 列宽:总宽减去边框与分隔符 - ncols = len(headers) - inner = max(term_width - 3 * (ncols - 1) - 4, 40) - base = max(6, inner // ncols) - col_widths = [ - min(5, base) if j == 0 else (min(26, max(12, base)) if j == 1 else base) - for j in range(ncols) - ] - w_rem = max(0, inner - col_widths[0] - col_widths[1]) - rest = ncols - 2 - if rest > 0: - per = max(10, w_rem // rest) - for j in range(2, ncols): - col_widths[j] = per - - # 顶线 - top = "┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐" - mid = "├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤" - bot = "└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘" - - def fmt_row(cells: List[str]) -> str: - out = [] - for j, (cell, w) in enumerate(zip(cells, col_widths)): - t = _truncate(cell.replace("\n", " "), w) - pad = w - _visible_len(t) - if pad < 0: - pad = 0 - out.append(" " + t + " " * pad + " ") - return "│" + "│".join(out) + "│" - - print(top) - print(fmt_row(headers)) - print(mid) - for row in rows: - print(fmt_row(row)) - print(bot) - - -def _build_body_title(query: str) -> Dict[str, Any]: - return { - "query": { - "multi_match": { - "query": query, - "fields": ["title.zh", "title.en"], - "type": "best_fields", - } - }, - "_source": _source_includes(), - "highlight": _highlight_clause(HIGHLIGHT_FIELDS_BY_MODE[1]), - } - - -def _build_body_qanchors(query: str) -> Dict[str, Any]: - return { - "query": { - "multi_match": { - "query": query, - "fields": ["qanchors.zh", "qanchors.en"], - "type": "best_fields", - } - }, - "_source": _source_includes(), - "highlight": _highlight_clause(HIGHLIGHT_FIELDS_BY_MODE[2]), - } - - -def _build_body_tags(query: str) -> Dict[str, Any]: - return { - "query": { - "bool": { - "should": [ - {"term": {"tags": query}}, - { - "wildcard": { - "tags": {"value": f"*{query}*", "case_insensitive": True} - } - }, - ], - "minimum_should_match": 1, - } - }, - "_source": _source_includes(), - "highlight": _highlight_clause(HIGHLIGHT_FIELDS_BY_MODE[3]), - } - - -def _looks_like_image_ref(url: str) -> bool: - """HTTP(S) URL、// URL、或存在的本地文件路径。""" - import os - - s = url.strip() - if not s: - return False - sl = s.lower() - if sl.startswith(("http://", "https://", "//")): - return True - if os.path.isfile(s): - return True - return False - - -def _encode_clip_query_vector(query: str) -> List[float]: - """ - 与索引中 image_embedding 同空间:图走 ``POST /embed/image``;文本走 ``POST /embed/clip_text``(6008)。 - """ - import numpy as np - - q = (query or "").strip() - if not q: - raise ValueError("empty query") - - from embeddings.image_encoder import CLIPImageEncoder - - enc = CLIPImageEncoder() - if _looks_like_image_ref(q): - vec = enc.encode_image_from_url(q, normalize_embeddings=True, priority=1) - else: - vec = enc.encode_clip_text(q, normalize_embeddings=True, priority=1) - return vec.astype(np.float32).flatten().tolist() - - -def search_title_knn(es: Any, index_name: str, query: str, size: int) -> List[Dict[str, Any]]: - from embeddings.text_encoder import TextEmbeddingEncoder - - enc = TextEmbeddingEncoder() - arr = enc.encode(query, normalize_embeddings=True) - vec = arr[0] - if vec is None: - raise RuntimeError("text embedding service returned no vector") - qv = vec.astype("float32").flatten().tolist() - num_cand = max(size * 10, 100) - body: Dict[str, Any] = { - "knn": { - "field": "title_embedding", - "query_vector": qv, - "k": size, - "num_candidates": num_cand, - }, - "_source": _source_includes(), - } - return _run_es(es, index_name, body, size) - - -def search_image_knn(es: Any, index_name: str, query: str, size: int) -> List[Dict[str, Any]]: - qv = _encode_clip_query_vector(query) - num_cand = max(size * 10, 100) - field = "image_embedding.vector" - body: Dict[str, Any] = { - "knn": { - "field": field, - "query_vector": qv, - "k": size, - "num_candidates": num_cand, - }, - "_source": _source_includes(), - } - return _run_es(es, index_name, body, size) - - -def main() -> None: - parser = argparse.ArgumentParser(description="Interactive ES debug search") - parser.add_argument( - "--tenant-id", - default=None, - help="Tenant id for index name search_products_tenant_{id} (default: env TENANT_ID or 170)", - ) - parser.add_argument( - "--index", - default=None, - help="Override full index name (skips tenant-based naming)", - ) - args = parser.parse_args() - - tenant = args.tenant_id or __import__("os").environ.get("TENANT_ID") or "170" - - from indexer.mapping_generator import get_tenant_index_name - from utils.es_client import get_es_client_from_env - - index_name = args.index or get_tenant_index_name(str(tenant)) - es = get_es_client_from_env().client - - dispatch: Dict[int, Callable[..., List[Dict[str, Any]]]] = { - 1: lambda e, idx, q, s: _run_es(e, idx, _build_body_title(q), s), - 2: lambda e, idx, q, s: _run_es(e, idx, _build_body_qanchors(q), s), - 3: lambda e, idx, q, s: _run_es(e, idx, _build_body_tags(q), s), - 4: search_title_knn, - 5: search_image_knn, - } - - term_w = shutil.get_terminal_size((100, 24)).columns - - print(f"索引: {index_name} (Ctrl+D 退出)\n") - while True: - try: - query = input("query> ").strip() - except EOFError: - print() - break - if not query: - continue - - mode = _select_mode() - fn = dispatch.get(mode, dispatch[1]) - - cols = _select_fields() - cols = _ordered_columns(cols) - - try: - raw_size = input("条数 [20]: ").strip() or "20" - size = max(1, int(raw_size)) - except EOFError: - print() - break - except ValueError: - size = 20 - - term_w = shutil.get_terminal_size((100, 24)).columns - print(f"--- mode={mode} ({OPTIONS[mode - 1][0]}) columns={','.join(cols)} size={size} ---") - try: - hits = fn(es, index_name, query, size) - if not hits: - print("(无命中)") - else: - _print_table(hits, cols, term_width=term_w) - except Exception as e: - print(f"错误: {e}", file=sys.stderr) - - -if __name__ == "__main__": - main() diff --git a/scripts/eval_search_quality.py b/scripts/eval_search_quality.py deleted file mode 100644 index 217776d..0000000 --- a/scripts/eval_search_quality.py +++ /dev/null @@ -1,235 +0,0 @@ -#!/usr/bin/env python3 -""" -Run search quality evaluation against real tenant indexes and emit JSON/Markdown reports. - -Usage: - source activate.sh - python scripts/eval_search_quality.py -""" - -from __future__ import annotations - -import json -import sys -from dataclasses import asdict, dataclass -from datetime import datetime, timezone -from pathlib import Path -from typing import Any, Dict, List - -PROJECT_ROOT = Path(__file__).resolve().parents[1] -if str(PROJECT_ROOT) not in sys.path: - sys.path.insert(0, str(PROJECT_ROOT)) - -from api.app import get_searcher, init_service -from context import create_request_context - - -DEFAULT_QUERIES_BY_TENANT: Dict[str, List[str]] = { - "0": [ - "连衣裙", - "dress", - "dress 连衣裙", - "maxi dress 长裙", - "波西米亚连衣裙", - "T恤", - "graphic tee 图案T恤", - "shirt", - "礼服衬衫", - "hoodie 卫衣", - "连帽卫衣", - "sweatshirt", - "牛仔裤", - "jeans", - "阔腿牛仔裤", - "毛衣 sweater", - "cardigan 开衫", - "jacket 外套", - "puffer jacket 羽绒服", - "飞行员夹克", - ], - "162": [ - "连衣裙", - "dress", - "dress 连衣裙", - "T恤", - "shirt", - "hoodie 卫衣", - "牛仔裤", - "jeans", - "毛衣 sweater", - "jacket 外套", - "娃娃衣服", - "芭比裙子", - "连衣短裙芭比", - "公主大裙", - "晚礼服芭比", - "毛衣熊", - "服饰饰品", - "鞋子", - "军人套", - "陆军套", - ], -} - - -@dataclass -class RankedItem: - rank: int - spu_id: str - title: str - vendor: str - es_score: float | None - rerank_score: float | None - text_score: float | None - text_source_score: float | None - text_translation_score: float | None - text_primary_score: float | None - text_support_score: float | None - knn_score: float | None - fused_score: float | None - matched_queries: Any - - -def _pick_text(value: Any, language: str = "zh") -> str: - if value is None: - return "" - if isinstance(value, dict): - return str(value.get(language) or value.get("zh") or value.get("en") or "").strip() - return str(value).strip() - - -def _to_float(value: Any) -> float | None: - try: - if value is None: - return None - return float(value) - except (TypeError, ValueError): - return None - - -def _evaluate_query(searcher, tenant_id: str, query: str) -> Dict[str, Any]: - context = create_request_context( - reqid=f"eval-{tenant_id}-{abs(hash(query)) % 1000000}", - uid="codex", - ) - result = searcher.search( - query=query, - tenant_id=tenant_id, - size=20, - from_=0, - context=context, - debug=True, - language="zh", - enable_rerank=True, - ) - - per_result_debug = ((result.debug_info or {}).get("per_result") or []) - debug_by_spu_id = { - str(item.get("spu_id")): item - for item in per_result_debug - if isinstance(item, dict) and item.get("spu_id") is not None - } - - ranked_items: List[RankedItem] = [] - for rank, spu in enumerate(result.results[:20], 1): - spu_id = str(getattr(spu, "spu_id", "")) - debug_item = debug_by_spu_id.get(spu_id, {}) - ranked_items.append( - RankedItem( - rank=rank, - spu_id=spu_id, - title=_pick_text(getattr(spu, "title", None), language="zh"), - vendor=_pick_text(getattr(spu, "vendor", None), language="zh"), - es_score=_to_float(debug_item.get("es_score")), - rerank_score=_to_float(debug_item.get("rerank_score")), - text_score=_to_float(debug_item.get("text_score")), - text_source_score=_to_float(debug_item.get("text_source_score")), - text_translation_score=_to_float(debug_item.get("text_translation_score")), - text_primary_score=_to_float(debug_item.get("text_primary_score")), - text_support_score=_to_float(debug_item.get("text_support_score")), - knn_score=_to_float(debug_item.get("knn_score")), - fused_score=_to_float(debug_item.get("fused_score")), - matched_queries=debug_item.get("matched_queries"), - ) - ) - - return { - "query": query, - "tenant_id": tenant_id, - "total": result.total, - "max_score": result.max_score, - "took_ms": result.took_ms, - "query_analysis": ((result.debug_info or {}).get("query_analysis") or {}), - "stage_timings": ((result.debug_info or {}).get("stage_timings") or {}), - "top20": [asdict(item) for item in ranked_items], - } - - -def _render_markdown(report: Dict[str, Any]) -> str: - lines: List[str] = [] - lines.append(f"# Search Quality Evaluation") - lines.append("") - lines.append(f"- Generated at: {report['generated_at']}") - lines.append(f"- Queries per tenant: {report['queries_per_tenant']}") - lines.append("") - for tenant_id, entries in report["tenants"].items(): - lines.append(f"## Tenant {tenant_id}") - lines.append("") - for entry in entries: - qa = entry.get("query_analysis") or {} - lines.append(f"### Query: {entry['query']}") - lines.append("") - lines.append( - f"- total={entry['total']} max_score={entry['max_score']:.6f} took_ms={entry['took_ms']}" - ) - lines.append( - f"- detected_language={qa.get('detected_language')} translations={qa.get('translations')}" - ) - lines.append("") - lines.append("| rank | spu_id | title | fused | rerank | text | text_src | text_trans | knn | es | matched_queries |") - lines.append("| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |") - for item in entry.get("top20", []): - title = str(item.get("title", "")).replace("|", "/") - matched = json.dumps(item.get("matched_queries"), ensure_ascii=False) - matched = matched.replace("|", "/") - lines.append( - f"| {item.get('rank')} | {item.get('spu_id')} | {title} | " - f"{item.get('fused_score')} | {item.get('rerank_score')} | {item.get('text_score')} | " - f"{item.get('text_source_score')} | {item.get('text_translation_score')} | " - f"{item.get('knn_score')} | {item.get('es_score')} | {matched} |" - ) - lines.append("") - return "\n".join(lines) - - -def main() -> None: - init_service("http://localhost:9200") - searcher = get_searcher() - - tenants_report: Dict[str, List[Dict[str, Any]]] = {} - for tenant_id, queries in DEFAULT_QUERIES_BY_TENANT.items(): - tenant_entries: List[Dict[str, Any]] = [] - for query in queries: - print(f"[eval] tenant={tenant_id} query={query}") - tenant_entries.append(_evaluate_query(searcher, tenant_id, query)) - tenants_report[tenant_id] = tenant_entries - - report = { - "generated_at": datetime.now(timezone.utc).isoformat(), - "queries_per_tenant": {tenant: len(queries) for tenant, queries in DEFAULT_QUERIES_BY_TENANT.items()}, - "tenants": tenants_report, - } - - out_dir = Path("artifacts/search_eval") - out_dir.mkdir(parents=True, exist_ok=True) - timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") - json_path = out_dir / f"search_eval_{timestamp}.json" - md_path = out_dir / f"search_eval_{timestamp}.md" - json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") - md_path.write_text(_render_markdown(report), encoding="utf-8") - print(f"[done] json={json_path}") - print(f"[done] md={md_path}") - - -if __name__ == "__main__": - main() diff --git a/scripts/evaluation/es_debug_search.py b/scripts/evaluation/es_debug_search.py new file mode 100644 index 0000000..e8e4989 --- /dev/null +++ b/scripts/evaluation/es_debug_search.py @@ -0,0 +1,543 @@ +#!/usr/bin/env python3 +""" +Interactive Elasticsearch debug search (standalone; not part of the main API). + +Flow: query → mode 1–5 → 选择显示列 (默认全选 title.zh/en, qanchors.zh/en, tags) → 条数 → 表格。 + +文本检索 (1–3) 使用 ES highlight,终端内红色 (ANSI) 标记匹配片段。 + +mode 5(image_embedding):图片 URL/本地路径走 POST /embed/image(6008);纯文本走 clip-as-service +gRPC(与 `embedding.image_backends.clip_as_service` 一致),不下载本地 CN-CLIP。若仅配置 +local_cnclip,请改用 clip_as_service 或只输入图片 URL。 + +Usage: + source activate.sh + python scripts/es_debug_search.py [--tenant-id ID] [--index NAME] +""" + +from __future__ import annotations + +import argparse +import curses +import re +import shutil +import sys +from pathlib import Path +from typing import Any, Callable, Dict, List, Sequence, Tuple + +PROJECT_ROOT = Path(__file__).resolve().parents[1] +if str(PROJECT_ROOT) not in sys.path: + sys.path.insert(0, str(PROJECT_ROOT)) + +OPTIONS: Sequence[tuple[str, str]] = ( + ("title", "title.zh / title.en"), + ("qanchors", "qanchors.zh / qanchors.en"), + ("tags", "tags (keyword)"), + ("title_embedding", "KNN: title_embedding (text service)"), + ("image_embedding", "KNN: image_embedding (HTTP 图 / grpc 文本)"), +) + +# 列定义:(列 id, 表头短名) +COLUMN_DEFS: Tuple[Tuple[str, str], ...] = ( + ("title.zh", "title.zh"), + ("title.en", "title.en"), + ("qanchors.zh", "qanchors.zh"), + ("qanchors.en", "qanchors.en"), + ("tags", "tags"), +) + +# 文本检索模式使用的 highlight 字段 +HIGHLIGHT_FIELDS_BY_MODE: Dict[int, List[str]] = { + 1: ["title.zh", "title.en"], + 2: ["qanchors.zh", "qanchors.en"], + 3: ["tags"], +} + +ANSI_RE = re.compile(r"\x1b\[[0-9;]*m") + + +def _strip_ansi(s: str) -> str: + return ANSI_RE.sub("", s) + + +def _visible_len(s: str) -> int: + return len(_strip_ansi(s)) + + +def _truncate(s: str, max_len: int) -> str: + if max_len <= 0: + return "" + if _visible_len(s) <= max_len: + return s + # 在纯文本长度上截断(忽略 ANSI 近似按字符截断) + plain = _strip_ansi(s) + if len(plain) <= max_len: + return s + return plain[: max_len - 1] + "…" + + +def _lang_field(source: Dict[str, Any], obj_key: str, lang: str) -> str: + obj = source.get(obj_key) + if isinstance(obj, dict): + return str(obj.get(lang) or "").strip() + if obj is None: + return "" + return str(obj).strip() + + +def _tags_str(source: Dict[str, Any]) -> str: + raw = source.get("tags") + if raw is None: + return "" + if isinstance(raw, list): + return ", ".join(str(x) for x in raw if x is not None) + return str(raw).strip() + + +def _cell_from_hit(hit: Dict[str, Any], field_id: str, source: Dict[str, Any]) -> str: + """优先使用 highlight,否则 _source。""" + hl = hit.get("highlight") or {} + if field_id in hl: + parts = hl[field_id] + if isinstance(parts, list): + if field_id == "tags": + return ", ".join(parts) + return parts[0] if parts else "" + return str(parts) + if field_id == "title.zh": + return _lang_field(source, "title", "zh") + if field_id == "title.en": + return _lang_field(source, "title", "en") + if field_id == "qanchors.zh": + return _lang_field(source, "qanchors", "zh") + if field_id == "qanchors.en": + return _lang_field(source, "qanchors", "en") + if field_id == "tags": + return _tags_str(source) + return "" + + +def _highlight_clause(field_names: Sequence[str]) -> Dict[str, Any]: + return { + "require_field_match": True, + "pre_tags": ["\x1b[31m"], + "post_tags": ["\x1b[0m"], + "fields": { + f: {"number_of_fragments": 0, "fragment_size": 8000} for f in field_names + }, + } + + +def _source_includes() -> List[str]: + return ["title", "qanchors", "tags", "spu_id"] + + +def _select_mode_curses() -> int: + labels = [f"{key} — {desc}" for key, desc in OPTIONS] + + def run(stdscr: Any) -> int: + curses.curs_set(0) + stdscr.keypad(True) + current = 0 + while True: + stdscr.erase() + stdscr.addstr( + 0, + 0, + "选择模式 (↑↓ 移动, Enter 确认; 默认第一项 title)", + curses.A_BOLD, + ) + for i, line in enumerate(labels): + attr = curses.A_REVERSE if i == current else curses.A_NORMAL + prefix = ">" if i == current else " " + stdscr.addstr(2 + i, 0, f"{prefix} {i + 1}. {line}", attr) + stdscr.refresh() + ch = stdscr.getch() + if ch in (curses.KEY_UP, ord("k")): + current = (current - 1) % len(labels) + elif ch in (curses.KEY_DOWN, ord("j")): + current = (current + 1) % len(labels) + elif ch in (10, 13): + return current + 1 + elif ch in (27,): + return 1 + + return int(curses.wrapper(run)) + + +def _select_mode_fallback() -> int: + print("选择模式 (直接回车 = 1 title):") + for i, (_k, desc) in enumerate(OPTIONS, 1): + print(f" {i}. {desc}") + raw = input("编号 [1]: ").strip() or "1" + try: + n = int(raw) + except ValueError: + n = 1 + return max(1, min(n, len(OPTIONS))) + + +def _select_mode() -> int: + if not sys.stdin.isatty(): + return _select_mode_fallback() + try: + return _select_mode_curses() + except Exception: + return _select_mode_fallback() + + +def _select_fields_curses() -> List[str]: + """返回选中的列 id 列表(顺序与 COLUMN_DEFS 一致)。""" + ids = [c[0] for c in COLUMN_DEFS] + labels = [c[1] for c in COLUMN_DEFS] + selected = [True] * len(ids) + + def run(stdscr: Any) -> List[str]: + curses.curs_set(0) + stdscr.keypad(True) + cur = 0 + while True: + stdscr.erase() + stdscr.addstr( + 0, + 0, + "选择显示列 (空格切换, Enter 确认; 默认全选)", + curses.A_BOLD, + ) + stdscr.addstr(1, 0, "a: 全选 / n: 全不选", curses.A_DIM) + row = 3 + for i, lab in enumerate(labels): + mark = "[x]" if selected[i] else "[ ]" + attr = curses.A_REVERSE if i == cur else curses.A_NORMAL + stdscr.addstr(row + i, 0, f"{mark} {lab}", attr) + stdscr.refresh() + ch = stdscr.getch() + if ch in (curses.KEY_UP, ord("k")): + cur = (cur - 1) % len(ids) + elif ch in (curses.KEY_DOWN, ord("j")): + cur = (cur + 1) % len(ids) + elif ch in (32,): # space + selected[cur] = not selected[cur] + elif ch in (ord("a"), ord("A")): + for j in range(len(selected)): + selected[j] = True + elif ch in (ord("n"), ord("N")): + for j in range(len(selected)): + selected[j] = False + elif ch in (10, 13): + if not any(selected): + for j in range(len(selected)): + selected[j] = True + return [ids[i] for i in range(len(ids)) if selected[i]] + elif ch in (27,): + return list(ids) + + return curses.wrapper(run) + + +def _select_fields_fallback() -> List[str]: + print("显示列 (编号 1-5 逗号分隔; 回车=全选):") + for i, (cid, lab) in enumerate(COLUMN_DEFS, 1): + print(f" {i}. {lab}") + raw = input("列 [1,2,3,4,5]: ").strip() + if not raw: + return [c[0] for c in COLUMN_DEFS] + out: List[str] = [] + for part in raw.replace(",", ",").split(","): + part = part.strip() + if not part: + continue + try: + n = int(part) + except ValueError: + continue + if 1 <= n <= len(COLUMN_DEFS): + cid = COLUMN_DEFS[n - 1][0] + if cid not in out: + out.append(cid) + return out if out else [c[0] for c in COLUMN_DEFS] + + +def _select_fields() -> List[str]: + if not sys.stdin.isatty(): + return _select_fields_fallback() + try: + return _select_fields_curses() + except Exception: + return _select_fields_fallback() + + +def _ordered_columns(selected: List[str]) -> List[str]: + """按 COLUMN_DEFS 顺序输出选中的列。""" + id_set = set(selected) + return [c[0] for c in COLUMN_DEFS if c[0] in id_set] + + +def _run_es( + es: Any, + index_name: str, + body: Dict[str, Any], + size: int, +) -> List[Dict[str, Any]]: + # Avoid passing size= alongside body= (deprecated in elasticsearch-py). + payload = {**body, "size": size} + resp = es.search(index=index_name, body=payload) + if hasattr(resp, "body"): + payload = resp.body + else: + payload = dict(resp) if not isinstance(resp, dict) else resp + hits = (payload.get("hits") or {}).get("hits") or [] + return hits + + +def _print_table( + hits: List[Dict[str, Any]], + columns: List[str], + *, + term_width: int, +) -> None: + """简单 Unicode 表格:#、doc_id、所选列。""" + if not columns: + columns = [c[0] for c in COLUMN_DEFS] + + headers = ["#", "doc_id"] + [next(h for cid, h in COLUMN_DEFS if cid == col) for col in columns] + + rows: List[List[str]] = [] + for i, hit in enumerate(hits, 1): + sid = str(hit.get("_id", "")) + src = hit.get("_source") or {} + cells = [str(i), sid] + for col in columns: + cells.append(_cell_from_hit(hit, col, src)) + rows.append(cells) + + # 列宽:总宽减去边框与分隔符 + ncols = len(headers) + inner = max(term_width - 3 * (ncols - 1) - 4, 40) + base = max(6, inner // ncols) + col_widths = [ + min(5, base) if j == 0 else (min(26, max(12, base)) if j == 1 else base) + for j in range(ncols) + ] + w_rem = max(0, inner - col_widths[0] - col_widths[1]) + rest = ncols - 2 + if rest > 0: + per = max(10, w_rem // rest) + for j in range(2, ncols): + col_widths[j] = per + + # 顶线 + top = "┌" + "┬".join("─" * (w + 2) for w in col_widths) + "┐" + mid = "├" + "┼".join("─" * (w + 2) for w in col_widths) + "┤" + bot = "└" + "┴".join("─" * (w + 2) for w in col_widths) + "┘" + + def fmt_row(cells: List[str]) -> str: + out = [] + for j, (cell, w) in enumerate(zip(cells, col_widths)): + t = _truncate(cell.replace("\n", " "), w) + pad = w - _visible_len(t) + if pad < 0: + pad = 0 + out.append(" " + t + " " * pad + " ") + return "│" + "│".join(out) + "│" + + print(top) + print(fmt_row(headers)) + print(mid) + for row in rows: + print(fmt_row(row)) + print(bot) + + +def _build_body_title(query: str) -> Dict[str, Any]: + return { + "query": { + "multi_match": { + "query": query, + "fields": ["title.zh", "title.en"], + "type": "best_fields", + } + }, + "_source": _source_includes(), + "highlight": _highlight_clause(HIGHLIGHT_FIELDS_BY_MODE[1]), + } + + +def _build_body_qanchors(query: str) -> Dict[str, Any]: + return { + "query": { + "multi_match": { + "query": query, + "fields": ["qanchors.zh", "qanchors.en"], + "type": "best_fields", + } + }, + "_source": _source_includes(), + "highlight": _highlight_clause(HIGHLIGHT_FIELDS_BY_MODE[2]), + } + + +def _build_body_tags(query: str) -> Dict[str, Any]: + return { + "query": { + "bool": { + "should": [ + {"term": {"tags": query}}, + { + "wildcard": { + "tags": {"value": f"*{query}*", "case_insensitive": True} + } + }, + ], + "minimum_should_match": 1, + } + }, + "_source": _source_includes(), + "highlight": _highlight_clause(HIGHLIGHT_FIELDS_BY_MODE[3]), + } + + +def _looks_like_image_ref(url: str) -> bool: + """HTTP(S) URL、// URL、或存在的本地文件路径。""" + import os + + s = url.strip() + if not s: + return False + sl = s.lower() + if sl.startswith(("http://", "https://", "//")): + return True + if os.path.isfile(s): + return True + return False + + +def _encode_clip_query_vector(query: str) -> List[float]: + """ + 与索引中 image_embedding 同空间:图走 ``POST /embed/image``;文本走 ``POST /embed/clip_text``(6008)。 + """ + import numpy as np + + q = (query or "").strip() + if not q: + raise ValueError("empty query") + + from embeddings.image_encoder import CLIPImageEncoder + + enc = CLIPImageEncoder() + if _looks_like_image_ref(q): + vec = enc.encode_image_from_url(q, normalize_embeddings=True, priority=1) + else: + vec = enc.encode_clip_text(q, normalize_embeddings=True, priority=1) + return vec.astype(np.float32).flatten().tolist() + + +def search_title_knn(es: Any, index_name: str, query: str, size: int) -> List[Dict[str, Any]]: + from embeddings.text_encoder import TextEmbeddingEncoder + + enc = TextEmbeddingEncoder() + arr = enc.encode(query, normalize_embeddings=True) + vec = arr[0] + if vec is None: + raise RuntimeError("text embedding service returned no vector") + qv = vec.astype("float32").flatten().tolist() + num_cand = max(size * 10, 100) + body: Dict[str, Any] = { + "knn": { + "field": "title_embedding", + "query_vector": qv, + "k": size, + "num_candidates": num_cand, + }, + "_source": _source_includes(), + } + return _run_es(es, index_name, body, size) + + +def search_image_knn(es: Any, index_name: str, query: str, size: int) -> List[Dict[str, Any]]: + qv = _encode_clip_query_vector(query) + num_cand = max(size * 10, 100) + field = "image_embedding.vector" + body: Dict[str, Any] = { + "knn": { + "field": field, + "query_vector": qv, + "k": size, + "num_candidates": num_cand, + }, + "_source": _source_includes(), + } + return _run_es(es, index_name, body, size) + + +def main() -> None: + parser = argparse.ArgumentParser(description="Interactive ES debug search") + parser.add_argument( + "--tenant-id", + default=None, + help="Tenant id for index name search_products_tenant_{id} (default: env TENANT_ID or 170)", + ) + parser.add_argument( + "--index", + default=None, + help="Override full index name (skips tenant-based naming)", + ) + args = parser.parse_args() + + tenant = args.tenant_id or __import__("os").environ.get("TENANT_ID") or "170" + + from indexer.mapping_generator import get_tenant_index_name + from utils.es_client import get_es_client_from_env + + index_name = args.index or get_tenant_index_name(str(tenant)) + es = get_es_client_from_env().client + + dispatch: Dict[int, Callable[..., List[Dict[str, Any]]]] = { + 1: lambda e, idx, q, s: _run_es(e, idx, _build_body_title(q), s), + 2: lambda e, idx, q, s: _run_es(e, idx, _build_body_qanchors(q), s), + 3: lambda e, idx, q, s: _run_es(e, idx, _build_body_tags(q), s), + 4: search_title_knn, + 5: search_image_knn, + } + + term_w = shutil.get_terminal_size((100, 24)).columns + + print(f"索引: {index_name} (Ctrl+D 退出)\n") + while True: + try: + query = input("query> ").strip() + except EOFError: + print() + break + if not query: + continue + + mode = _select_mode() + fn = dispatch.get(mode, dispatch[1]) + + cols = _select_fields() + cols = _ordered_columns(cols) + + try: + raw_size = input("条数 [20]: ").strip() or "20" + size = max(1, int(raw_size)) + except EOFError: + print() + break + except ValueError: + size = 20 + + term_w = shutil.get_terminal_size((100, 24)).columns + print(f"--- mode={mode} ({OPTIONS[mode - 1][0]}) columns={','.join(cols)} size={size} ---") + try: + hits = fn(es, index_name, query, size) + if not hits: + print("(无命中)") + else: + _print_table(hits, cols, term_width=term_w) + except Exception as e: + print(f"错误: {e}", file=sys.stderr) + + +if __name__ == "__main__": + main() diff --git a/scripts/evaluation/eval_search_quality.py b/scripts/evaluation/eval_search_quality.py new file mode 100644 index 0000000..217776d --- /dev/null +++ b/scripts/evaluation/eval_search_quality.py @@ -0,0 +1,235 @@ +#!/usr/bin/env python3 +""" +Run search quality evaluation against real tenant indexes and emit JSON/Markdown reports. + +Usage: + source activate.sh + python scripts/eval_search_quality.py +""" + +from __future__ import annotations + +import json +import sys +from dataclasses import asdict, dataclass +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Dict, List + +PROJECT_ROOT = Path(__file__).resolve().parents[1] +if str(PROJECT_ROOT) not in sys.path: + sys.path.insert(0, str(PROJECT_ROOT)) + +from api.app import get_searcher, init_service +from context import create_request_context + + +DEFAULT_QUERIES_BY_TENANT: Dict[str, List[str]] = { + "0": [ + "连衣裙", + "dress", + "dress 连衣裙", + "maxi dress 长裙", + "波西米亚连衣裙", + "T恤", + "graphic tee 图案T恤", + "shirt", + "礼服衬衫", + "hoodie 卫衣", + "连帽卫衣", + "sweatshirt", + "牛仔裤", + "jeans", + "阔腿牛仔裤", + "毛衣 sweater", + "cardigan 开衫", + "jacket 外套", + "puffer jacket 羽绒服", + "飞行员夹克", + ], + "162": [ + "连衣裙", + "dress", + "dress 连衣裙", + "T恤", + "shirt", + "hoodie 卫衣", + "牛仔裤", + "jeans", + "毛衣 sweater", + "jacket 外套", + "娃娃衣服", + "芭比裙子", + "连衣短裙芭比", + "公主大裙", + "晚礼服芭比", + "毛衣熊", + "服饰饰品", + "鞋子", + "军人套", + "陆军套", + ], +} + + +@dataclass +class RankedItem: + rank: int + spu_id: str + title: str + vendor: str + es_score: float | None + rerank_score: float | None + text_score: float | None + text_source_score: float | None + text_translation_score: float | None + text_primary_score: float | None + text_support_score: float | None + knn_score: float | None + fused_score: float | None + matched_queries: Any + + +def _pick_text(value: Any, language: str = "zh") -> str: + if value is None: + return "" + if isinstance(value, dict): + return str(value.get(language) or value.get("zh") or value.get("en") or "").strip() + return str(value).strip() + + +def _to_float(value: Any) -> float | None: + try: + if value is None: + return None + return float(value) + except (TypeError, ValueError): + return None + + +def _evaluate_query(searcher, tenant_id: str, query: str) -> Dict[str, Any]: + context = create_request_context( + reqid=f"eval-{tenant_id}-{abs(hash(query)) % 1000000}", + uid="codex", + ) + result = searcher.search( + query=query, + tenant_id=tenant_id, + size=20, + from_=0, + context=context, + debug=True, + language="zh", + enable_rerank=True, + ) + + per_result_debug = ((result.debug_info or {}).get("per_result") or []) + debug_by_spu_id = { + str(item.get("spu_id")): item + for item in per_result_debug + if isinstance(item, dict) and item.get("spu_id") is not None + } + + ranked_items: List[RankedItem] = [] + for rank, spu in enumerate(result.results[:20], 1): + spu_id = str(getattr(spu, "spu_id", "")) + debug_item = debug_by_spu_id.get(spu_id, {}) + ranked_items.append( + RankedItem( + rank=rank, + spu_id=spu_id, + title=_pick_text(getattr(spu, "title", None), language="zh"), + vendor=_pick_text(getattr(spu, "vendor", None), language="zh"), + es_score=_to_float(debug_item.get("es_score")), + rerank_score=_to_float(debug_item.get("rerank_score")), + text_score=_to_float(debug_item.get("text_score")), + text_source_score=_to_float(debug_item.get("text_source_score")), + text_translation_score=_to_float(debug_item.get("text_translation_score")), + text_primary_score=_to_float(debug_item.get("text_primary_score")), + text_support_score=_to_float(debug_item.get("text_support_score")), + knn_score=_to_float(debug_item.get("knn_score")), + fused_score=_to_float(debug_item.get("fused_score")), + matched_queries=debug_item.get("matched_queries"), + ) + ) + + return { + "query": query, + "tenant_id": tenant_id, + "total": result.total, + "max_score": result.max_score, + "took_ms": result.took_ms, + "query_analysis": ((result.debug_info or {}).get("query_analysis") or {}), + "stage_timings": ((result.debug_info or {}).get("stage_timings") or {}), + "top20": [asdict(item) for item in ranked_items], + } + + +def _render_markdown(report: Dict[str, Any]) -> str: + lines: List[str] = [] + lines.append(f"# Search Quality Evaluation") + lines.append("") + lines.append(f"- Generated at: {report['generated_at']}") + lines.append(f"- Queries per tenant: {report['queries_per_tenant']}") + lines.append("") + for tenant_id, entries in report["tenants"].items(): + lines.append(f"## Tenant {tenant_id}") + lines.append("") + for entry in entries: + qa = entry.get("query_analysis") or {} + lines.append(f"### Query: {entry['query']}") + lines.append("") + lines.append( + f"- total={entry['total']} max_score={entry['max_score']:.6f} took_ms={entry['took_ms']}" + ) + lines.append( + f"- detected_language={qa.get('detected_language')} translations={qa.get('translations')}" + ) + lines.append("") + lines.append("| rank | spu_id | title | fused | rerank | text | text_src | text_trans | knn | es | matched_queries |") + lines.append("| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |") + for item in entry.get("top20", []): + title = str(item.get("title", "")).replace("|", "/") + matched = json.dumps(item.get("matched_queries"), ensure_ascii=False) + matched = matched.replace("|", "/") + lines.append( + f"| {item.get('rank')} | {item.get('spu_id')} | {title} | " + f"{item.get('fused_score')} | {item.get('rerank_score')} | {item.get('text_score')} | " + f"{item.get('text_source_score')} | {item.get('text_translation_score')} | " + f"{item.get('knn_score')} | {item.get('es_score')} | {matched} |" + ) + lines.append("") + return "\n".join(lines) + + +def main() -> None: + init_service("http://localhost:9200") + searcher = get_searcher() + + tenants_report: Dict[str, List[Dict[str, Any]]] = {} + for tenant_id, queries in DEFAULT_QUERIES_BY_TENANT.items(): + tenant_entries: List[Dict[str, Any]] = [] + for query in queries: + print(f"[eval] tenant={tenant_id} query={query}") + tenant_entries.append(_evaluate_query(searcher, tenant_id, query)) + tenants_report[tenant_id] = tenant_entries + + report = { + "generated_at": datetime.now(timezone.utc).isoformat(), + "queries_per_tenant": {tenant: len(queries) for tenant, queries in DEFAULT_QUERIES_BY_TENANT.items()}, + "tenants": tenants_report, + } + + out_dir = Path("artifacts/search_eval") + out_dir.mkdir(parents=True, exist_ok=True) + timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ") + json_path = out_dir / f"search_eval_{timestamp}.json" + md_path = out_dir / f"search_eval_{timestamp}.md" + json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8") + md_path.write_text(_render_markdown(report), encoding="utf-8") + print(f"[done] json={json_path}") + print(f"[done] md={md_path}") + + +if __name__ == "__main__": + main() diff --git a/scripts/evaluation/ff.md b/scripts/evaluation/ff.md new file mode 100644 index 0000000..bdda782 --- /dev/null +++ b/scripts/evaluation/ff.md @@ -0,0 +1,22 @@ + + + +- R3 完全相关:该结果的核心意图被满足,标题/副标题/类目/属性不违背意图。 +- R2 部分相关:同品类或相近用途,但规格/材质/年龄段/场景等维度的要求有偏差。 +- R1 不相关:品类/用途不符,或明显错误/违禁/空结果。 + + +## 指标说明 + +- **相关性**:1=低,2=中,3=高。 +- **「仅 3 相关」**:只把打分 3 视为相关;**「2_3 相关」**:把 2 和 3 都视为相关。 + +| 指标 | 含义 | +|------|------| +| **P@5, P@10, P@20, P@50** | 前 K 个结果中「仅 3 相关」的精确率 | +| **P@5_2_3 ~ P@50_2_3** | 前 K 个结果中「2 和 3 都算相关」的精确率 | +| **MAP_3** | 仅 3 相关时的 Average Precision(单 query) | +| **MAP_2_3** | 2 和 3 都相关时的 Average Precision | +| **AUC_3** | 仅 3 相关、1/2 不相关时,随机相关项排在随机不相关项前的概率 | +| **AUC_2_3** | 2 和 3 相关、1 不相关时的同上 AUC | + diff --git a/scripts/evaluation/queries/fashion_quries__high_quality.txt.v2.uniq b/scripts/evaluation/queries/fashion_quries__high_quality.txt.v2.uniq new file mode 100644 index 0000000..0906ef4 --- /dev/null +++ b/scripts/evaluation/queries/fashion_quries__high_quality.txt.v2.uniq @@ -0,0 +1,1046 @@ +abstract print top +acrylic beanie hat +acrylic knit hat +airport outfit +a-line dress +a-line skirt +all-season travel pants +all-season vest lightweight +ankle boots +ankle length dress +ankle length jeans +ankle length maxi dress +ankle pants +ankle pants wrinkle-free +ankle socks +ankle socks white +anorak jacket +anti-chafing body glide stick +anti-chafing gym shorts +anti-chafing pants +anti-chafing running shorts +anti-chafing seamless yoga shorts +anti-chafing women's running shorts +arm sleeves +asymmetrical top +athleisure compression leggings +athleisure sports bra +athleisure track suit +athletic fit compression shorts +athletic fit running shorts +athletic gear +athletic shorts +athletic socks +athletic socks merino wool +athletic tank top +athletic tee +autumn blazer tweed +autumn trench coat +baby blanket +baby onesie +backpack +baggy jeans +balloon sleeve +balloon sleeve blouse +bamboo socks +basic cotton t-shirt +basic tank top ribbed +bath robe +bathrobe +bathrobe women +bath towel +beach cover up +beach towel +beach umbrella +beach vacation cover-up +beach vacation outfit +beach vacation swimsuit +beach wedding dress +beanie hat +bed sheets +beige trench coat +bell bottoms +belt +belted coat +belted jumpsuit +belted jumpsuit wide leg +belted jumpsuit with pockets +belted kimono robe +belted linen blend maxi dress +belted moto jacket +belted trench coat with pockets +belted waist trainer +bikini bottom +bishop sleeve +bishop sleeve top +black boots +black mini dress +black tie dress +blazer +blue denim jacket +boat neck sweater +bodycon dress +bodysuit +bohemian maxi dress +bohemian tassel maxi skirt +bohemian tunic cotton +bootcut corduroy pants +bootcut denim pants for riding boots +bootcut jeans +bootcut jeans retro +bootcut pants +bootcut pants for cowboy boots +bootcut pants for winter boots +bootcut velvet pants +bootcut yoga pants +boxer briefs +boyfriend jeans +bracelet set +breathable shorts +bridesmaid dress +brown leather boots +bucket hat +burgundy dress +burgundy holiday dress +burnt orange pants +business casual blazer +business casual chinos +business casual loafers +business casual women +business suit +business travel suit +butterfly hair clip +butterfly top +button down shirt +button-down shirt +cable knit sweater +camisole +cami top +canvas tote +cap +cap sleeve +cap sleeve top +cardigan +cargo pants +cargo pants hidden pocket +cargo skirt +cashmere blend cardigan +cashmere cardigan +cashmere scarf +cashmere sweater +cashmere sweater beige +casual day romper +casual day tank top +casual friday polo shirt +casual pants +casual polo shirt +casual weekend outfit +cat bed +charcoal gray coat +checkered skirt +chelsea boots +chiffon blouse +chiffon blouse sheer +chiffon dress +chiffon scarf lightweight +chinos +chocolate brown boots +chunky knit sweater +chunky platform boots +classic fit blazer +classic fit polo shirt +classic trench coat tall +classic wool coat +cleaning solution +closed toe heels +clutch bag +coachella outfit +coat +cobalt blue dress +cocktail dress +cold-shoulder blouse +cold weather insulated boots +cold weather wool socks +colorblock hoodie +commuter outfits +compression anti-chafing running leggings +compression arm sleeves +compression athletic gear +compression bra +compression gym workout running shorts +compression knee brace +compression leggings +compression shorts +compression stockings +compression tights +compression tights winter +compression top +Compression Top Spandex +compression yoga sports bra +concert outfit +convertible backpack travel bag +convertible dress +convertible hiking pants +convertible laptop bag +convertible neck pillow blanket +convertible pants +convertible sleeves shirt +convertible sleeves UPF 50 shirt +convertible zip-off hiking pants +corduroy blazer +corduroy fall pants +corduroy jacket +corduroy pants +corduroy skirt +corduroy skirt mini +corset top +corset top black velvet +cottagecore floral maxi dress +cottagecore knit cardigan +cotton flannel shirt +cotton-padded jacket +cotton pajamas +cotton tank top +cotton t-shirt +cotton t-shirt white +cotton tunic +couple outfits +cover-up +cowboy boots +cream cardigan +cream knit sweater +cream sweater +crew neck sweatshirt +crew neck tee +crochet top +crop hoodie +crop length hoodie +crop length sports bra +crop top +crossbody leather handbag +cruise outfits +curvy fit belt +curvy fit high-waisted jeans +curvy fit maxi dress +curvy fit skinny jeans +curvy fit straight leg jeans +curvy fit stretch jeans +curvy fit stretch pencil skirt +curvy fit trousers +curvy fit wide leg jeans +cushion insert +cut-out dress +cycling shorts +dark academia blazer +dark academia plaid trousers +dark academia wool coat +date night bodycon dress +date night dress +date night mini dress +Dating outfit +denim jacket +denim jacket distressed +denim jacket oversized +denim jeans +denim pants +denim shirt +denim shorts +denim skirt +denim skirt midi +distressed denim jacket +distressed jeans +ditsy floral dress +dog blanket +dog coat +dog crate mat +dog harness +dog leash +dog towel +dog toy +dog treat +double breasted coat +down jacket +down parka +down vest +dress +dress sandals +dress shirt +duffle bag +dusty rose dress +dusty rose silk slip dress +dusty rose velvet cocktail dress +dusty rose velvet ribbon +dusty rose velvet scrunchie +dusty rose velvet silk dress +dusty rose velvet slip skirt +early autumn corduroy jacket +earthy tones olive +earthy tones outfit +electric blue mini skirt +embroidered blouse +emerald green top +ergonomic office chair +evening formal gown +evening formal heels +evening gown +face mask fashion +fall cardigan cozy +fall denim jacket +fall transitional trench coat +fanny pack +fathers day shirt +faux fur coat +faux leather clutch bag +faux leather jacket +faux leather jacket cropped +faux leather jacket moto style +faux leather leggings +faux leather leggings matte +faux leather moto jacket +faux leather moto jacket belted +faux leather pants +faux leather skirt +faux leather wallet +festival concert crop top +festival concert jumpsuit +festival outfit +field jacket +first date outfit +fisherman knit +fisherman knit sweater +fishing shirt +fishing shirt long sleeve +fishnet top +flannel shirt +flare blouse +flare jeans +flare pants +flare pants velvet +flare sleeved blouse +fleece jacket +fleece jacket winter +fleece lined leggings +fleece-lined leggings +fleece mid layer +fleece pullover +floral blouse +floral jacket +floral maxi dress +flowy chiffon dress +flutter sleeve +flutter sleeve top +formal attire +formal dress +formal gown +four-way stretch black yoga pants +four-way stretch denim jeans +four-way stretch elastic band +four-way stretch maternity leggings +four-way stretch tall slim fit pants +four-way stretch yoga pants +fringe bag +geometric pattern sweater +gingham dress +gingham pattern dress +gingham picnic dress +gingham summer dress +glasses frames +gold jewelry +gold necklace +gold sequin top +golf shirt +golf shirt uv protection +gothic black boots +gothic lace-up corset top +gothic velvet dress +graduation ceremony tailored suit +graduation dress +graphic hoodie +graphic tee +graphic t-shirt +graphic t-shirt vintage wash +gray hoodie +green cargo pants +grunge distressed jeans +grunge flannel shirt +gym bag +gym leggings with pocket +gym outfit +gym towel +gym workout compression top +gym workout running shorts +gym workout sports bra +hair towel +halloween costume ideas +halter neck top +hanger space saving +heels +hidden pocket fanny pack +hidden pocket infinity scarf +hidden pocket passport holder vest +hidden pocket travel belt +hidden pocket travel vest +hidden pocket women's travel vest +high heel compatible cocktail dress +high heel compatible dress sandals +high heel compatible formal dress +high heel compatible formal jumpsuit +high heel compatible shoe inserts +high neck dress +high rise leggings +high rise pants +high rise wide leg pants +high-waisted denim jeans +high-waisted denim shorts +high waisted jeans +high-waisted jeans +hiking boots +hiking outfit +hiking pants +hiking pants women +hiking shirt convertible sleeves +hiking shoes +hiking socks +hiking trail backpack +hiking trail raincoat +hippie bell bottoms +hippie tie-dye dress +holiday season formal gown +holiday season velvet dress +homecoming dress +homewear +hooded jacket +hoodie +houndstooth coat +instagram outfit +insulated base layer +insulated boots +insulated gloves +interview clothes +invisible socks +ivory dress +ivory wedding dress +jacket +jeans +jean shorts +jeggings +jewel tones dress +jewel tones emerald +jogger pants +jumpsuit +kawaii crop top +kawaii mini dress +keyhole top +khaki green backpack +khaki green canvas tote +khaki green cargo pants utility +khaki green military jacket +khaki green ripstop convertible pants +khaki green ripstop fabric trousers +kids cotton pajamas set +kids' cotton pajamas set +kimono robe +knee brace +knee high boots +knit cardigan +knit hat +knit scarf +knit sweater +knitwear +lace bralette +lace dress +lantern sleeve +lantern sleeve top +laptop backpack +laptop bag +layering hoodie +leather bag +leather handbag +leather jacket +leather sandals +leather shoes +leather skirt +leggings +leopard print blouse +light jacket +lightweight jacket +lightweight scarf +linen blend maxi dress +linen blend trousers +linen dress +linen fabric breathable +linen pants +linen shorts +linen short set +linen shorts summer +linen trousers +litter box +loafers +long blazer +long coat +long sleeve top +long tunic top +loose fit boyfriend jeans +loose fit cap +loose fit cardigan cashmere +loose fit graphic hoodie streetwear +loose fit graphic print t-shirt +loose fit graphic t-shirt +loose fit knit sweater +loose fit summer pure cotton t-shirt +loose fit tunic +lounge wear bodysuit +lounge wear pajamas set +low rise bikini bottom +low rise jeans +low rise shorts +maternity casual friday polo shirt +maternity dress +maternity nursing bra +maternity pillow +maternity robe silk +maternity support band leggings +maternity support leggings +maternity support leggings black +maternity swimsuit +maternity tall trousers +maternity trousers +maternity wear +maternity wide leg denim jeans +maternity yoga Pants +mauve pink cardigan +maxi dress +maxi skirt +men's athletic running shorts +men's dress shirt +men's running shorts +men's suit vest +merino wool base layer +merino wool glove liners +merino wool hiking socks +merino wool ski socks +merino wool socks +merino wool socks hiking +merino wool thermal base layer top +merino wool thermal leggings base layer +merino wool thermal underwear +mesh athletic shorts +mesh athletic tee +mesh shirt +mesh shoes +metallic leggings +Microfiber Robe +Microfiber Towel Quick-Dry +midi cocktail dress +midi dress +mid rise bootcut jeans +mid rise skirts +mid-season waterproof shell jacket +military field Jacket +military green jacket +military jacket +mini dress +minimalist high rise pants +minimalist linen dress +minimalist linen trousers +minimalist wallet +mini skirt +mock neck shirt +modest long sleeve top +modest swimsuit one-piece +moisture-wicking cycling jersey +moisture-wicking golf shirt +moisture-wicking golf shirt men's +moisture-wicking gym towel +moisture-wicking hiking trail t-shirt +mom outfit +monochrome outfit +mother bride dress +mothers day gift +moto jacket +mustard yellow top +nautical pea coat +nautical striped sweater +navy blazer +navy blue cashmere beanie +navy blue cashmere blanket +navy blue cashmere blend scarf +navy blue cashmere scarf +navy blue cashmere winter coat +neon pink crop top +neutral tones beige +neutral tones clothing +new mom clothes +new years eve dress +no show socks +notch lapel blazer +nurse scrubs +nursing bra +nursing dress +nylon raincoat waterproof +nylon running shorts +nylon shorts +odor control athletic headbands +odor control athletic socks +odor control men's athletic socks +odor control running shoes insoles +odor control shoe spray +odor control trash can +office wear +office wear pencil skirt +office wear sheath dress +off-shoulder dress +off shoulder top +off white sneakers +olive green jacket +olive jacket +olive utility jacket +one-piece swimsuit +one-shoulder top +organic cotton baby blanket +organic cotton baby wear +organic cotton crew neck tee +organic cotton crew neck tee vintage +organic cotton crew neck t-shirt +organic cotton t-shirt +organic cotton underwear +organic cotton washcloth +organic cotton white t-shirt +organza top +oversized cardigan +oversized hoodie +oversized knit sweater +oversized puffer vest +oversized pullover hoodie +oversized scarf +oversized t-shirt +oversized turtleneck sweater +oversized zip up hoodie +paisley print +paisley print boho +pajamas +pajamas set +palazzo pants +pants +parent-child matching outfits +party dress +pastel purple sweater +patterned jumpsuit +patterned scarf +pea coat +peak lapel formal +pencil skirt +pencil skirt stretch +pet camera +peter pan collar +petite a-line skirt +petite ankle boots +petite jeans +petite summer dress +petite summer linen shorts +petite tailored business suit +petite tailored suit jacket +petite tailored trousers wrinkle-free +petite wedding guest midi dress +photoshoot outfit +pink sweater +pink sweater aesthetic +plaid pants +plaid skirt +plaid trousers +platform boots +platform sneakers +play clothes +pleated tennis skirt +plus size blouse +plus size denim jeans +plus size petite jeans +plus size swim dress +plus size tunic tops +plus size tunic tops for leggings +plus size winter fleece lined leggings +plus size winter gloves +plus size winter hiking boots +plus size winter parka jacket +plus-size women's clothing +polarized sunglasses +polka dot dress +polo shirt +polyester fleece jacket +polyester running Shorts +polyester shorts +postpartum outfit +pregnancy pillow +preppy plaid skirt +preppy plaid tennis skirt +preppy sweater vest +prom dress +puffer jacket +puffer vest +puff sleeve blouse +puff sleeve dress +pullover hoodie +punk faux leather skirt +punk studded belt +pure cotton t-shirt +pure linen bath towel +pure linen short set +pure linen short set beach vacation +pure linen summer dress maxi +pure linen tablecloth +pure linen wide leg pants +quick-dry shirt +quick-dry towel +quilted jacket +rain boots +raincoat +rain jacket +rainy season anorak jacket +rainy season waterproof boots +rayon blouse +rayon blouse print +rayon jumpsuit +rayon jumpsuit summer +Recycled Fabric Hoodie +recycled fabric sneakers +recycled polyester bag +recycled polyester gym bag +recycled polyester laptop sleeve +recycled polyester reusable shopping bag +recycled polyester running jacket +recycled polyester swim shorts +recycled polyester waterproof shell jacket +red silk blouse +relaxed fit denim jacket +relaxed fit denim shorts +relaxed fit sweater +relaxed fit sweatpants +resort wear +retro flare jeans +retro patterned jumpsuit +ribbed tank top +riding boots +ripstop cargo trousers +robe +romantic chiffon blouse +romantic puff sleeve blouse +romantic silk blouse +romper +rose gold jewelry +round sunglasses +ruffle skirt +ruffle sleeve +ruffle sleeve top +running jacket +running leggings +running shorts +running shorts with lining +rust colored sweater +rust midi skirt +sage green lounge set +sandals +satin cami top +satin robe +scarf +scented candle +school uniform +school uniform blazer +school uniform polo shirt +scoop neck tank +scrunchie +seamless yoga leggings +semi-formal evening gown +sequin cocktail dress +shaving razor +shawl collar cardigan +sheath dress +sheer organza top +shell jacket +sherpa jacket +shimmering metallic crop top +shimmering metallic leggings +shimmering metallic leggings yoga +shimmering metallic party dress +shimmering metallic phone case +shirt +shoe inserts +shoe laces +shorts +short set +silk blouse +silk dress +silk pajamas +silk pajamas set +silk robe +silk slip dress +silver metallic heels +skiing fleece mid layer +skiing thermal underwear +skiing trip insulated base layer +skinny fit jeggings +skinny fit pants +skinny jeans +skinny pants +ski outfit +skirt +skirt suit +ski socks +sleeveless dress +sleeveless summer dress +slim fit ankle pants +slim fit blazer wool +slim fit button-down shirt +slim fit trousers +slip dress +slip on sneakers +smart casual men +snake print boots +sneaker matching ankle socks +sneaker matching crew socks +sneaker matching invisible socks +sneaker matching low cut socks +sneaker matching no show socks +sneaker matching shoe laces +sneakers +sock boots +sofa bed +soft robe +spaghetti strap cami +spaghetti strap dress +Spandex Compression Bra +spandex cycling shorts +sports bra +sports bra high impact +sports top +sportswear outfit +spring blouse floral +spring lightweight floral jacket +spring trench coat +square neck top +stain-resistant apron cooking +stain-resistant boys play pants +stain-resistant kids' play clothes +stain-resistant kids school uniform +stain-resistant placemats +stain-resistant white t-shirt +stiletto boots +stiletto heels +stiletto protectors +straight leg ankle pants +straight leg ankle pants office wear +straight leg ankle pants stretch +straight leg anti-chafing pants +straight leg jeans +straight leg pants +straight leg trousers work +straight leg trousers workwear +strapless dress +straw hat +streetwear graphic hoodie +streetwear hoodie men +streetwear oversized hoodie +streetwear zip-up hoodie +stretch jeans +stretch jeans straight leg +striped shirt +striped shirt navy +striped sweater +studded belt +suede ankle boots +suede boots +suede loafers +suede loafers casual +suit +suit jacket +suit men +summer quick-dry UPF 50 shirt +summer strapless dress +summer sundress floral +sundress +sun hat +sun hat wide brim +sun protection clothing +sun protective clothing +support belt +sweater +sweater dress +sweater vest +sweatpants +sweatshirt +sweatsuit +sweetheart neckline dress +swim dress +swim shorts +swimsuit +swim trunks +tailored shirt +tailored suit +tailored suit jacket +tailored suit slim fit +tall slim fit business casual blazer +tall slim fit men's dress shirt +tall slim fit men's linen shirt +tall slim fit men's suit vest +tall slim fit office wear blazer +tall slim fit tie +tall slim fit trousers +tall straight leg pants +tank top +tassel maxi skirt +teacher clothes +teal blouse +teal satin blouse +temperature regulating mattress pad +temperature regulating pajamas +temperature regulating pillow +temperature regulating silk pajamas +temperature regulating sleepwear +temperature regulating winter socks +tennis skirt +terracotta linen pants +terracotta pants +terry cloth bathrobe +terry cloth shorts +terry cloth sweatshirt +thermal base layer +thermal base layer top +thermal long sleeve top +thermal skiing jacket +thermal underwear +thigh shorts anti-chafing +tie +tie-dye bucket hat +tie-dye dress +tie-dye oversized knit sweater +tie-dye oversized sweatshirt +tie-dye socks pastel colors +tie-dye summer maxi dress +tie-dye water bottle +tie-waist pants +tiktok outfit +toddler crib +tote bag +track suit +track suit quick-dry +transitional light jacket +transitional sweater vest +Travel attire +travel blazer +travel blazer lightweight +travel clothes +travel lightweight scarf +travel pants +travel suit +travel vest +travel wrinkle-free pants +trench coat +tropical climate breathable shorts +tropical climate sandals +Tropical Climate Sandals Leather +tropical print shirt +trousers +trumpet sleeve +trumpet sleeve top +t-shirt +tube top +tulle overlay skirt +tulle skirt +tulle skirt party +tunic +tunic top +tunic tops +tunic tops loose fit +turtleneck sweater +tweed blazer +two piece set +ugly christmas sweater +umbrella +underwear +utility cargo pants +utility cargo skirt +UV protection face mask +UV protection fishing shirt +UV protection gardening hat +UV protection sun hat +UV protection women's sun gloves +vacation outfits +valentines day outfit +velvet cocktail party dress +velvet dress +velvet heels +velvet heels formal +velvet holiday dress +velvet midi skirt +velvet skirt +vest +vintage denim jacket +vintage distressed denim jacket +vintage leather skirt +viscose cami top +viscose scarf patterned +v-neck t-shirt +waffle knit henley +waist trainer +wallet +waterproof boots +waterproof hooded jacket +waterproof raincoat +waterproof shell jacket +waterproof spray +Wearing small clothes +wedding guest cocktail dress +wedding guest dress +wedding guest formal attire +wedding guest midi dress +wedding guest slip dress +western cowboy boots +western denim shirt +white cotton t-shirt +white linen pants +wide brim hat +wide leg jeans +wide leg linen pants +wide leg palazzo trousers +wide leg pants +wide leg trousers +window film +winter fleece-lined leggings +winter gloves +Winter outfits +winter parka +winter parka down +winter puffer jacket +women's summer maxi dress +wool blazer +wool blend long coat +wool coat +wool coat tall +wool pea coat +wool socks +wool winter coat +work from home +workout clothes +workwear trousers +wrap dress +wrinkle-free business shirt +wrinkle-free business travel suit +wrinkle-free fabric spray +wrinkle-free pants +wrinkle-free travel blazer +wrinkle-free travel dress +wrinkle-free travel pants +wrinkle release spray +y2k low-rise baggy jeans +y2k low-rise cargo pants +y2k low rise jeans +y2k tube top +yoga leggings +yoga mat +yoga outfit +yoga pants +yoga pants high-waisted +yoga shorts +zebra print pants +zipper boots +zip-up hoodie +zoom shirt diff --git a/scripts/evaluation/queries/fashion_quries__high_quality.txt.v2.uniq.trans b/scripts/evaluation/queries/fashion_quries__high_quality.txt.v2.uniq.trans new file mode 100644 index 0000000..4c164cd --- /dev/null +++ b/scripts/evaluation/queries/fashion_quries__high_quality.txt.v2.uniq.trans @@ -0,0 +1,1232 @@ +抽象打印顶部 +丙烯酸无檐便帽 +亚克力豆豆帽 +丙烯酸针织帽 +亚克力针织帽 +机场装备 +A字裙 +a字裙 +全季旅行裤 +四季旅行裤 +全季轻便背心 +踝靴 +及踝连衣裙 +及踝牛仔裤 +及膝牛仔裤 +及踝长裙 +齐踝长裙 +脚踝裤 +脚踝裤不起皱 +脚踝长裤无皱纹 +短袜 +白色脚踝袜 +防寒茄克衫 +防擦车身滑杆 +防擦伤运动短裤 +防擦伤裤 +防擦伤长裤 +防擦伤跑步短裤 +防摩擦无缝瑜伽短裤 +防擦伤女式跑步短裤 +臂袖 +不对称陀螺 +运动休闲紧身裤 +运动紧身裤 +运动休闲文胸 +运动休闲运动内衣 +运动休闲运动套装 +运动型紧身短裤 +运动型跑步短裤 +运动装备 +运动短裤 +运动袜 +美利奴羊毛运动袜 +运动背心 +运动T恤 +秋季花呢上衣 +秋季运动夹克 +秋季风衣 +婴儿毯 +婴儿连体衣 +背包 +宽松牛仔裤 +布袋牛仔裤 +气球袖 +气球袖衬衫 +竹制袜子 +竹袜 +基本棉t恤 +基本棉T恤 +基本油箱顶部带肋 +基本款背心罗纹 +浴袍 +女式浴袍 +浴巾 +海滩掩护 +海滩遮盖 +沙滩巾 +沙滩伞 +海滩度假掩盖 +海滩度假装 +海滩度假套装 +海滩度假泳装 +海滩度假泳衣 +沙滩婚纱 +毛线帽 +床单 +米色风衣 +喇叭裤 +腰带 +系带大衣 +皮带外套 +系带连身裤 +阔腿系带连身裤 +带口袋的系带连身裤 +腰带和服长袍 +系带亚麻混纺长裙 +系带摩托车夹克 +带腰带的摩托车夹克 +带口袋的系带风衣 +束腰训练器 +比基尼海滩 +主教袖 +主教袖上衣 +黑色靴子 +黑色迷你连衣裙 +黑色领带裙 +西装外套 +蓝色牛仔夹克 +船领毛衣 +紧身连衣裙 +连体衣 +波西米亚风格的长裙 +波西米亚长裙 +波西米亚流苏长裙 +波西米亚棉质束腰外衣 +波西米亚束腰棉 +短靴灯芯绒长裤 +马靴用短靴牛仔裤 +短靴牛仔裤 +复古短靴牛仔裤 +短靴裤 +开襟长裤 +牛仔靴开叉裤 +冬季靴子的短靴裤 +天鹅绒短靴裤 +短靴瑜伽裤 +四角裤 +平角内裤 +男朋友牛仔裤 +手镯套装 +透气短裤 +伴娘裙 +棕色皮靴 +渔夫帽 +酒红色连衣裙 +酒红色节日连衣裙 +勃艮第节日连衣裙 +焦橙色裤子 +烧焦的橙色裤子 +商务休闲夹克 +商务休闲斜纹棉布裤 +商务休闲奇诺 +商务休闲乐福鞋 +商务休闲女性 +商务套装 +商务旅行套装 +蝴蝶发夹 +蝶形上衣 +蝴蝶上衣 +纽扣衬衫 +针织毛衣 +女士背心 +吊带背心 +卡米上衣 +帆布手提包 +帽子 +帽 +帽袖 +帽套顶部 +帽袖上衣 +开襟毛衣 +工装裤 +工装裤隐藏口袋 +载货裙 +羊绒混纺开衫 +羊绒开衫 +羊绒围巾 +羊绒衫 +米色羊绒衫 +米色羊绒毛衣 +休闲日间连身裤 +休闲日服 +休闲日间背心 +休闲日背心 +休闲星期五马球衫 +休闲裤 +休闲马球衫 +休闲Polo衫 +休闲周末装 +猫床 +炭灰色大衣 +炭灰色涂层 +格子裙 +切尔西靴 +雪纺衬衫 +雪纺衬衫透明 +雪纺裙 +雪纺围巾轻盈 +斜纹棉布裤 +巧克力棕色靴子 +厚实针织毛衣 +厚底靴 +经典合身上衣 +经典合身运动夹克 +经典合身马球衫 +经典合身Polo衫 +经典风衣高 +经典羊毛大衣 +清洁溶液 +闭趾高跟鞋 +手拿包 +科切拉服装 +外套 +钴蓝色连衣裙 +鸡尾酒裙 +露肩上衣 +冷肩衬衫 +防寒保暖靴 +寒冷天气隔热靴 +寒冷天气羊毛袜 +拼色连帽衫 +通勤者服装 +紧身防擦伤跑步紧身裤 +压缩臂袖 +压缩式运动装备 +压缩运动装备 +压缩胸罩 +压缩健身房锻炼跑步短裤 +膝关节加压支架 +紧身裤 +紧身短裤 +压力袜 +紧身紧身裤 +冬季紧身裤 +压缩式上衣 +紧身上衣氨纶 +压缩瑜伽运动胸罩 +演唱会服装 +可转换背包旅行包 +敞篷连衣裙 +敞篷登山裤 +敞篷徒步裤 +可转换笔记本电脑包 +可转换颈枕毯 +敞篷裤 +可转换袖衬衫 +可转换袖UPF 50衬衫 +可转换拉链登山裤 +灯芯绒运动夹克 +灯芯绒秋裤 +灯芯绒秋季长裤 +条绒夹克 +灯芯绒长裤 +灯芯绒裙子 +灯芯绒迷你裙 +束身衣上衣 +黑色天鹅绒紧身胸衣上衣 +黑色天鹅绒紧身胸衣 +棉芯印花长裙 +Cottagecore印花长裙 +棉芯针织开衫 +Cottagecore针织开衫 +棉绒布衬衫 +棉袄 +棉睡衣 +棉质背心 +棉背心 +棉质t恤 +白色棉质t恤 +白色棉质T恤 +棉质上衣 +情侣装 +掩盖 +牛仔靴 +奶油色开衫 +奶油色针织毛衣 +奶油针织毛衣 +奶油色毛衣 +圆领运动衫 +圆领T恤 +钩针上衣 +露脐连帽衫 +露脐运动文胸 +及膝运动内衣 +露脐上衣 +斜身皮包 +斜挎包 +游轮服装 +曲线贴合腰带 +曲线合身高腰牛仔裤 +曲线合身的长裙 +修身长裙 +曲线合身的紧身牛仔裤 +修身牛仔裤 +曲线合身的直筒牛仔裤 +曲线合身的弹力牛仔裤 +曲线合身的弹力铅笔裙 +曲线合身的裤子 +修身长裤 +曲线合身阔腿牛仔裤 +修身宽腿牛仔裤 +衬垫衬垫 +镂空连衣裙 +骑行短裤 +深色学术运动夹克 +黑暗学院开拓者 +深色学术格子裤 +深色学院格子长裤 +深色学术羊毛大衣 +约会之夜紧身连衣裙 +约会晚礼服 +约会之夜迷你裙 +约会之夜迷你连衣裙 +约会装 +牛仔夹克 +破旧牛仔夹克 +牛仔夹克破旧 +大号牛仔夹克 +牛仔夹克超大 +牛仔牛仔裤 +牛仔裤 +牛仔衬衫 +牛仔短裤 +牛仔裙 +牛仔裙midi +牛仔裙Midi +破旧牛仔夹克 +破旧牛仔裤 +碎花连衣裙 +狗毯 +狗大衣 +狗笼垫 +狗用背带 +狗绳 +狗毛巾 +狗玩具 +狗食 +双排扣大衣 +双排扣外套 +羽绒服 +羽绒背心 +连衣裙 +连衣裙凉鞋 +正装凉鞋 +衬衫 +行李袋 +尘土飞扬的玫瑰色连衣裙 +玫瑰粉连衣裙 +尘土飞扬的玫瑰色丝绸吊带裙 +尘土飞扬的玫瑰色天鹅绒鸡尾酒会礼服 +尘土飞扬的玫瑰天鹅绒缎带 +尘土飞扬的玫瑰天鹅绒发髻 +尘土飞扬的玫瑰色天鹅绒丝绸连衣裙 +尘土飞扬的玫瑰色天鹅绒吊带裙 +初秋灯芯绒夹克 +泥土色调橄榄色 +土色调服装 +土色调套装 +电蓝色迷你裙 +绣花上衣 +祖母绿上衣 +翡翠绿上衣 +符合人体工程学的办公椅 +晚礼服 +晚礼服高跟鞋 +晚礼服 +口罩时尚 +秋季羊毛衫舒适 +秋季舒适开衫 +秋季牛仔夹克 +秋季过渡风衣 +腰包 +父亲节衬衫 +人造毛皮大衣 +人造革手包 +人造皮夹克 +短款人造皮夹克 +摩托风格人造皮夹克 +人造皮革紧身裤 +哑光人造皮革紧身裤 +仿皮哑光紧身裤 +人造皮革摩托车夹克 +人造皮革摩托车夹克,带腰带 +人造皮裤 +人造革长裤 +人造革裙子 +人造皮革钱包 +节日音乐会露顶 +节日音乐会Crop Top +节日音乐会连身裤 +节日音乐会连体衣 +节日服装 +野战夹克 +初次约会服装 +渔夫针织 +渔夫针织毛衣 +钓鱼衬衫 +长袖钓鱼衫 +钓鱼衫长袖 +渔网上衣 +法兰绒衬衫 +喇叭形上衣 +喇叭牛仔裤 +喇叭裤 +天鹅绒喇叭裤 +丝绒喇叭裤 +喇叭袖衬衫 +抓绒夹克 +冬季羊毛夹克 +羊毛夹克冬季 +羊毛衬里紧身裤 +羊毛中层 +羊毛套头衫 +花衬衫 +花夹克 +印花长裙 +花朵长裙 +飘逸雪纺连衣裙 +颤动袖 +颤振袖顶 +飘逸袖上衣 +礼服 +正装 +礼服 +四向弹力黑色瑜伽裤 +四向弹力牛仔牛仔裤 +四向拉伸弹性带 +四向弹力孕妇紧身裤 +四向弹力高修身长裤 +四向弹力瑜伽裤 +流苏包 +流苏袋 +几何图案毛衣 +条纹连衣裙 +金黄色连衣裙 +条纹野餐裙 +Gingham野餐裙 +条纹夏装 +金黄色连衣裙 +眼镜架 +黄金首饰 +黄金珠宝 +金项链 +金色亮片上衣 +高尔夫球衫 +高尔夫球衫紫外线防护 +哥特式黑色靴子 +哥特式系带紧身胸衣 +哥特式天鹅绒连衣裙 +毕业典礼定制西装 +毕业礼服 +图案连帽衫 +图案T恤 +图案t恤 +复古水洗图案t恤 +图案T恤复古水洗 +灰色连帽衫 +绿色工装裤 +破旧牛仔裤 +破烂牛仔裤 +脏兮兮的法兰绒衬衫 +粗面法兰绒衬衫 +健身包 +带口袋的健身紧身裤 +带口袋的健身房紧身裤 +运动服 +健身巾 +健身房毛巾 +健身房运动紧身上衣 +健身房运动短裤 +健身房锻炼跑步短裤 +健身房运动文胸 +健身房运动内衣 +发巾 +万圣节服装创意 +露背领上衣 +Halter领上衣 +节省衣架空间 +高跟鞋 +隐藏式口袋腰包 +隐藏式口袋无限围巾 +隐藏式口袋护照夹背心 +隐藏式口袋旅行带 +隐藏口袋旅行背心 +隐藏口袋女式旅行背心 +高跟兼容鸡尾酒会礼服 +高跟兼容连衣裙凉鞋 +高跟兼容的正式连衣裙 +与高跟鞋兼容的正式连身裤 +与高跟鞋兼容的鞋垫 +高领连衣裙 +高腰紧身裤 +高腰裤 +高腰阔腿裤 +高腰阔腿长裤 +高腰牛仔牛仔裤 +高腰牛仔短裤 +高腰牛仔裤 +登山靴 +徒步旅行装备 +登山裤 +徒步裤女士 +徒步女裤 +登山衫可转换袖 +徒步旅行衬衫可转换袖 +旅游鞋 +登山袜 +徒步旅行背包 +远足径雨衣 +嬉皮士喇叭裤 +嬉皮士扎染连衣裙 +节日礼服 +假日季正式礼服 +节日天鹅绒连衣裙 +假日天鹅绒连衣裙 +返校节礼服 +家居服 +外套 +连帽衫 +千鸟皮大衣 +Houndstooth大衣 +instagram套装 +隔热基层 +保温基层 +绝缘靴 +绝缘手套 +隔热手套 +面试服装 +隐形袜子 +象牙色连衣裙 +象牙色婚纱 +夹克 +牛仔裤 +牛仔短裤 +牛仔样式打底裤 +宝石色连衣裙 +珠宝色调连衣裙 +宝石色调祖母绿 +慢跑裤 +束脚裤 +连体衣 +kawaii作物顶部 +Kawaii作物顶部 +kawaii迷你连衣裙 +Kawaii迷你连衣裙 +钥匙孔顶部 +卡其色绿色背包 +卡其色绿色帆布手提包 +卡其色绿色工装裤实用 +卡其色绿色军用夹克 +卡其绿色防刮敞篷裤 +卡其绿色耐磨面料长裤 +儿童棉睡衣套装 +和服 +护膝 +过膝长靴 +针织开衫 +针织帽 +针织围巾 +针织毛衣 +针织品 +蕾丝文胸 +蕾丝连衣裙 +灯笼袖 +灯笼袖上衣 +笔记本电脑背包 +笔记本电脑包 +分层连帽衫 +皮革包 +皮包 +皮夹克 +皮凉鞋 +皮鞋 +皮裙 +紧身裤 +豹纹衬衫 +薄夹克衫 +轻便夹克 +轻便围巾 +亚麻混纺长裙 +亚麻混纺长裤 +亚麻连衣裙 +亚麻织物透气 +亚麻裤 +亚麻短裤 +亚麻短套装 +夏季亚麻短裤 +亚麻短裤夏季 +亚麻裤 +猫砂盆 +乐福鞋 +长款运动夹克 +长大衣 +长袖上衣 +长款束腰上衣 +宽松男朋友牛仔裤 +宽松的帽子 +宽松开衫羊绒 +宽松开襟羊毛衫 +宽松图案连帽衫街头装 +宽松图案印花t恤 +宽松图案t恤 +宽松针织毛衣 +宽松夏季纯棉t恤 +宽松束腰外衣 +休闲服连体衣 +休闲睡衣套装 +休闲服睡衣套装 +低腰比基尼下装 +低腰牛仔裤 +低腰短裤 +孕妇休闲周五马球衫 +孕妇装 +哺乳文胸 +孕妇枕头 +孕妇枕 +真丝孕妇袍 +丝绸孕妇服 +孕妇支撑带紧身裤 +孕妇支撑紧身裤 +黑色孕妇支撑紧身裤 +孕妇泳装 +孕妇长裤 +孕妇高长裤 +孕妇裤 +孕妇装 +孕妇阔腿牛仔牛仔裤 +孕妇瑜伽裤 +紫粉色开衫 +长裙 +超长连衣裙 +超长裙 +男式运动跑步短裤 +男士正装衬衫 +男士跑步短裤 +男式西装背心 +男士西装背心 +美利奴羊毛基层 +美利奴羊毛手套衬里 +美利奴羊毛登山袜 +美利奴羊毛徒步袜 +美利奴羊毛滑雪袜 +美利奴羊毛袜 +美利奴羊毛袜徒步旅行 +美利奴羊毛保暖基层顶层 +美利奴羊毛保暖紧身裤底层 +美利奴羊毛保暖内衣 +网眼运动短裤 +网眼运动T恤 +网眼衬衫 +网眼鞋 +金属紧身裤 +超细纤维长袍 +超细纤维毛巾快干 +midi鸡尾酒会连衣裙 +Midi鸡尾酒会礼服 +中长裙 +中腰短靴牛仔裤 +中号开襟牛仔裤 +中腰裙 +中季防水外套 +军用野战夹克 +军绿色夹克 +军用夹克 +迷你裙 +极简主义的高腰裤 +极简主义高帮长裤 +极简主义亚麻连衣裙 +极简主义亚麻长裤 +极简主义钱包 +迷你裙 +立领衬衫 +适中的长袖上衣 +朴素的连体泳衣 +一件式泳衣 +吸湿排汗的自行车运动衫 +吸湿高尔夫衬衫 +男式吸湿高尔夫衬衫 +吸湿运动毛巾 +吸湿排汗徒步旅行t恤 +妈妈装 +单色服装 +母亲新娘礼服 +母亲节礼物 +摩托车夹克 +芥末黄色上衣 +芥末黄上衣 +航海豌豆大衣 +航海泥炭大衣 +航海条纹毛衣 +海军蓝上衣 +海军运动夹克 +海军蓝羊绒无檐便帽 +海军蓝羊绒毯 +海军蓝羊绒混纺围巾 +海军蓝羊绒围巾 +海军蓝羊绒冬季大衣 +霓虹粉色露脐上衣 +中性色调米色 +中性色调服装 +新妈妈的衣服 +除夕夜礼服 +隐形袜 +无显示袜子 +凹口翻领上衣 +凹口翻领运动夹克 +护士擦洗 +哺乳胸罩 +护理服 +尼龙雨衣防水 +尼龙跑步短裤 +尼龙短裤 +气味控制运动头带 +气味控制运动袜 +气味控制男式运动袜 +气味控制跑鞋鞋垫 +气味控制鞋喷雾 +气味控制垃圾桶 +办公室服装 +办公室穿铅笔裙 +办公室穿紧身连衣裙 +露肩连衣裙 +露肩上衣 +米白色运动鞋 +橄榄绿夹克 +橄榄色夹克 +橄榄色多功能夹克 +橄榄色夹克 +连体式泳衣 +单肩上衣 +有机棉婴儿毯 +有机棉婴儿服 +有机棉圆领T恤 +有机棉圆领复古T恤 +有机棉圆领t恤 +有机棉t恤 +有机棉内衣 +有机棉毛巾 +有机棉白色t恤 +透明硬纱上衣 +超大开衫 +超大连帽衫 +超大针织毛衣 +超大羽绒背心 +超大套头衫连帽衫 +超大围巾 +超大t恤 +超大T恤 +超大高领毛衣 +超大拉链连帽衫 +佩斯利花纹 +佩斯利印花波西米亚风格 +睡衣 +睡衣套装 +阔腿裤 +裤子 +亲子配套服装 +派对礼服 +淡紫色毛衣 +图案连身衣 +图案围巾 +双排扣大衣 +翻领正装 +铅笔裙 +铅笔裙弹力 +宠物照相机 +小圆领 +娇小的a字裙 +娇小A字裙 +小脚靴 +小脚踝靴 +小脚牛仔裤 +娇小的夏装 +娇小的夏季亚麻短裤 +小巧的定制商务套装 +小巧的定制西装外套 +修身长裤,不起皱 +娇小的婚礼宾客中长裙 +摄影服装 +粉红色毛衣 +粉红色毛衣美学 +格子裤 +格子裙 +格子长裤 +厚底靴 +厚底运动鞋 +玩衣服 +打褶网球裙 +褶皱网球裙 +加大码衬衫 +加大码牛仔牛仔裤 +加大码小牛仔裤 +加大号泳装 +加大码泳装 +加大码束腰上衣 +加大码束腰外衣上衣搭配紧身裤 +加大码冬季羊毛衬里紧身裤 +加大码冬季手套 +加大码冬季登山靴 +加大码冬季大衣 +加大码女装 +偏光太阳镜 +圆点花纹服 +Polo衫 +涤纶羊毛夹克 +涤纶跑步短裤 +涤纶短裤 +产后装 +孕期枕头 +学院派格子裙 +Preppy格子裙 +预科生格子网裙 +预科生毛衣背心 +Preppy毛衣背心 +舞会礼服 +羽绒服 +羽绒背心 +蓬松袖衬衫 +泡泡袖衬衫 +蓬松袖连衣裙 +泡泡袖连衣裙 +连帽卫衣 +朋克仿皮裙 +朋克假皮革裙 +朋克风格的腰带 +朋克钉带 +纯棉t恤 +纯棉T恤 +纯亚麻浴巾 +纯亚麻短套装 +纯亚麻短套装海滩度假 +纯亚麻夏装长裙 +纯亚麻桌布 +纯亚麻阔腿裤 +快干衬衫 +快干毛巾 +绗缝夹克 +棉服 +雨靴 +雨衣 +防雨夹克 +雨季风衣 +雨季Anorak夹克 +雨季防水靴 +人造丝上衣 +人造丝衬衫印花 +人造丝连身衣 +夏季人造丝连身裤 +夏季人造丝连体衣 +再生纤维连帽衫 +再生纤维运动鞋 +再生聚酯袋 +再生聚酯健身袋 +再生聚酯笔记本电脑套 +再生聚酯可重复使用购物袋 +再生聚酯跑步夹克 +再生聚酯游泳短裤 +再生聚酯防水外套 +红色丝绸衬衫 +宽松的牛仔夹克 +宽松的牛仔短裤 +宽松版型牛仔短裤 +宽松版型毛衣 +宽松运动裤 +宽松版型运动裤 +度假服装 +复古喇叭牛仔裤 +复古喇叭裤 +复古图案连身裤 +肋状罐顶 +罗纹背心 +马靴 +防撕裂工装裤 +长袍 +浪漫雪纺衬衫 +浪漫的泡泡袖上衣 +浪漫的丝绸衬衫 +浪漫丝绸衬衫 +连体衣 +玫瑰金首饰 +圆形太阳镜 +皱边裙 +泡泡袖 +褶边袖上衣 +运动夹克 +跑步夹克 +跑步紧身裤 +运动短裤 +带衬里的运动短裤 +带衬里的跑步短裤 +铁锈色毛衣 +锈色中裙 +锈米裙 +鼠尾草绿色休息室套装 +Sage绿色休息室套装 +凉鞋 +缎面卡米上衣 +缎面背心 +缎面长袍 +围巾 +香薰蜡烛 +校服 +校服上衣 +校服运动夹克 +校服马球衫 +校服Polo衫 +勺颈油箱 +大圆领背心 +发圈 +无缝瑜伽紧身裤 +半正式晚礼服 +亮片鸡尾酒会连衣裙 +亮片鸡尾酒会礼服 +剃须刀 +青果领开衫 +紧身连衣裙 +透明硬纱上衣 +短外套 +夏尔巴夹克 +夏尔巴人夹克 +闪闪发光的金属露脐上衣 +闪闪发光的金属紧身裤 +闪闪发光的金属紧身裤瑜伽 +闪闪发光的金属色派对裙 +闪闪发光的金属手机壳 +衬衫 +鞋嵌件 +鞋套 +鞋带 +短裤 +短集 +丝绸衬衫 +丝绸连衣裙 +丝绸睡衣 +真丝睡衣 +丝绸睡衣套装 +丝绸长袍 +丝绸吊带裙 +银色金属高跟鞋 +滑雪羊毛中层 +滑雪保暖内衣 +滑雪旅行保温基层 +紧身牛仔裤 +紧身裤 +紧身长裤 +紧身牛仔裤 +紧身裤 +滑雪用具 +裙子 +裙装 +滑雪袜 +滑雪短袜 +无袖连衣裙 +无袖夏装 +修身及踝长裤 +修身脚踝长裤 +修身运动夹克羊毛 +修身运动衫羊毛 +修身纽扣衬衫 +修身长裤 +吊带裙 +穿上运动鞋 +防滑运动鞋 +时髦休闲男士 +蛇纹靴 +运动鞋搭配脚踝袜 +运动鞋搭配工装裤 +运动鞋搭配隐形袜子 +运动鞋搭配低帮袜子 +运动鞋搭配未露面的袜子 +运动鞋配套鞋带 +运动鞋 +袜子靴 +沙发床 +柔软的长袍 +意大利面条表带卡米 +意大利面条吊带背心 +细肩带连衣裙 +氨纶压缩文胸 +氨纶自行车短裤 +氨纶骑行短裤 +运动文胸 +运动文胸高冲击力 +运动内衣高冲击力 +运动上衣 +运动装 +春季花衬衫 +春季花朵衬衫 +春季轻质花夹克 +春季风衣少女 +春季风衣 +方领上衣 +防污围裙烹饪 +防污男孩运动裤 +防污儿童游戏服 +防污儿童校服 +防污餐垫 +防污白色t恤 +细高跟靴 +细高跟鞋 +细高跟鞋护具 +直筒裤 +直筒及踝长裤办公服 +直筒及踝弹力裤 +直筒防擦伤裤 +直筒牛仔裤 +直筒裤 +直筒裤工作 +直筒裤工作服 +无肩带连衣裙 +草帽 +街头服饰图案连帽衫 +街头风衣男子连帽衫 +街头装连帽衫男式 +街头装超大连帽衫 +街头装拉链连帽衫 +弹力牛仔裤 +直筒弹力牛仔裤 +弹力牛仔裤直筒 +条纹衬衫 +深蓝色条纹衬衫 +条纹衬衫海军蓝 +条纹毛衣 +镶钉皮带 +绒面短靴 +绒面脚踝靴 +绒面靴 +绒面乐福鞋 +绒面休闲乐福鞋 +绒面乐福鞋休闲 +套装 +西装外套 +西装男 +男士西装 +夏季快干UPF 50衬衫 +夏季无肩带连衣裙 +夏季太阳花连衣裙 +夏季连衣裙花朵 +太阳裙 +遮阳帽 +宽边太阳帽 +防晒服 +支撑带 +毛衣 +毛衣连衣裙 +毛衣背心 +运动裤 +运动衫 +运动服 +甜心领口连衣裙 +泳装 +游泳短裤 +泳衣 +泳裤 +定制衬衫 +女式西服 +定制西装夹克 +定制西装修身 +定制修身西装 +高挑修身商务休闲夹克 +高挑修身男式衬衫 +高挑修身男式亚麻衬衫 +高挑修身男式西装背心 +高挑修身西装上衣 +高挑修身领带 +高挑修身长裤 +高腰修身长裤 +高直筒裤 +背心 +流苏长裙 +教师服装 +青色衬衫 +青色缎面衬衫 +温度调节床垫 +调温睡衣 +调温枕 +调温丝绸睡衣 +调温睡衣 +调温冬袜 +网球裙 +赤土亚麻长裤 +赤陶亚麻长裤 +赤土裤 +毛圈布浴袍 +毛巾布浴袍 +毛圈布短裤 +毛圈布运动衫 +Terry布运动衫 +热基层 +热基层顶部 +保暖长袖上衣 +保暖滑雪夹克 +保暖内衣 +防擦伤大腿短裤 +大腿短裤防擦伤 +系 +领带 +扎染桶帽 +扎染连衣裙 +扎染超大针织毛衣 +扎染大号运动衫 +扎染超大运动衫 +扎染袜子颜色柔和 +扎染夏季长裙 +扎染水瓶 +扎腰裤 +系带长裤 +tiktok服装 +幼儿床 +托特包 +运动服 +运动服快干 +过渡轻便夹克 +过渡性毛衣背心 +过渡款毛衣背心 +旅行服装 +旅行夹克 +轻便旅行夹克 +旅行服装 +旅行轻便围巾 +旅行裤 +旅行套装 +旅行背心 +旅行防皱裤 +旅行无皱纹长裤 +风衣 +热带气候透气短裤 +热带气候凉鞋 +热带气候凉鞋皮革 +热带印花衬衫 +裤子 +喇叭袖 +喇叭袖上衣 +小号袖上衣 +T恤 +抹胸上衣 +薄纱覆面裙 +薄纱裙 +薄纱裙派对 +束腰外衣 +束腰上衣 +宽松上衣 +束腰上衣宽松版型 +高领毛衣 +粗花呢西装外套 +两件套 +难看的圣诞毛衣 +伞 +内衣 +多功能工装裤 +通用工装裤 +实用货物裙 +多功能货物裙 +紫外线防护口罩 +防紫外线钓鱼衫 +防紫外线园艺帽 +防紫外线太阳帽 +防紫外线女式防晒手套 +度假服装 +情人节服装 +天鹅绒鸡尾酒会礼服 +天鹅绒连衣裙 +天鹅绒高跟鞋 +天鹅绒高跟鞋正式 +天鹅绒高跟正式 +天鹅绒节日连衣裙 +天鹅绒中长裙 +丝绒迷笛裙 +天鹅绒裙子 +背心 +复古牛仔夹克 +复古旧牛仔夹克 +复古皮裙 +粘胶卡米上衣 +粘胶纤维背心 +粘胶围巾图案 +V领T恤 +华夫饼针织亨利 +华夫格针织亨利 +束腰 +钱包 +防水靴 +防水连帽夹克 +防水雨衣 +防水外壳 +防水外壳外套 +防水喷雾 +穿着小衣服 +婚礼宾客鸡尾酒会礼服 +婚礼宾客礼服 +婚礼宾客正装 +婚礼宾客中长裙 +婚礼宾客迷笛裙 +婚礼宾客吊带裙 +西部牛仔靴 +西式牛仔衬衫 +白色棉质t恤 +白色亚麻长裤 +宽边帽 +售宽裤筒牛仔裤 +阔腿亚麻裤 +宽腿亚麻长裤 +阔腿宫殿长裤 +阔腿Palazzo长裤 +阔腿裤 +窗膜 +冬季羊毛衬里紧身裤 +冬季手套 +冬季服装 +冬季派克大衣 +冬季公园 +冬季大衣羽绒服 +冬季帕卡羽绒服 +冬季羽绒服 +女式夏季长裙 +羊毛外套 +羊毛混纺长外套 +羊毛大衣 +羊毛大衣高 +羊毛豌豆大衣 +羊毛袜 +羊毛冬衣 +羊毛冬季大衣 +居家办公 +运动服 +工作裤 +工作服长裤 +裹身裙 +抗皱商务衬衫 +抗皱商务旅行套装 +抗皱织物喷雾 +抗皱长裤 +抗皱旅行夹克 +抗皱旅行裙 +抗皱旅行裤 +抗皱喷雾 +y2k低腰宽松牛仔裤 +y2k低腰工装裤 +y2k低腰牛仔裤 +Y2K低腰牛仔裤 +y2k管顶 +Y2K管顶 +瑜伽紧身裤 +瑜伽垫 +瑜伽服 +瑜伽裤 +高腰瑜伽裤 +瑜伽短裤 +斑马印花长裤 +拉链靴 +拉链连帽衫 +变焦衬衫 diff --git a/scripts/evaluation/queries/queries.txt b/scripts/evaluation/queries/queries.txt new file mode 100644 index 0000000..a91aad6 --- /dev/null +++ b/scripts/evaluation/queries/queries.txt @@ -0,0 +1,43 @@ +白色oversized T-shirt +falda negra oficina +red fitted tee +黒いミディ丈スカート +黑色中长半身裙 +فستان أسود متوسط الطول +чёрное летнее платье +修身牛仔裤 +date night dress +vacation outfit dress +minimalist top +streetwear t-shirt +office casual blouse +街头风T恤 +宽松T恤 +复古印花T恤 +Y2K上衣 +情侣T恤 +美式复古T恤 +重磅棉T恤 +修身打底衫 +辣妹风短袖 +纯欲上衣 +正肩白T恤 +波西米亚花朵衬衫 +泡泡袖短袖 +扎染字母T恤 +T-shirt Dress +Crop Top +Lace Undershirt +Leopard Print Ripped T-shirt +Breton Stripe T-shirt +V-Neck Cotton T-shirt +Sweet & Cool Bow T-shirt +Vacation Style T-shirt +Commuter Casual Top +Minimalist Solid T-shirt +Band T-shirt +Athletic Gym T-shirt +Plus Size Loose T-shirt +Korean Style Slim T-shirt +Basic Layering Top + diff --git a/tests/queries.txt b/tests/queries.txt deleted file mode 100644 index a91aad6..0000000 --- a/tests/queries.txt +++ /dev/null @@ -1,43 +0,0 @@ -白色oversized T-shirt -falda negra oficina -red fitted tee -黒いミディ丈スカート -黑色中长半身裙 -فستان أسود متوسط الطول -чёрное летнее платье -修身牛仔裤 -date night dress -vacation outfit dress -minimalist top -streetwear t-shirt -office casual blouse -街头风T恤 -宽松T恤 -复古印花T恤 -Y2K上衣 -情侣T恤 -美式复古T恤 -重磅棉T恤 -修身打底衫 -辣妹风短袖 -纯欲上衣 -正肩白T恤 -波西米亚花朵衬衫 -泡泡袖短袖 -扎染字母T恤 -T-shirt Dress -Crop Top -Lace Undershirt -Leopard Print Ripped T-shirt -Breton Stripe T-shirt -V-Neck Cotton T-shirt -Sweet & Cool Bow T-shirt -Vacation Style T-shirt -Commuter Casual Top -Minimalist Solid T-shirt -Band T-shirt -Athletic Gym T-shirt -Plus Size Loose T-shirt -Korean Style Slim T-shirt -Basic Layering Top - diff --git a/前端分面配置说明.md b/前端分面配置说明.md deleted file mode 100644 index 02f7c23..0000000 --- a/前端分面配置说明.md +++ /dev/null @@ -1,176 +0,0 @@ -# 前端分面配置说明 - -## 问题描述 - -tenant_id=170 的分面返回为空,原因是: -1. `category1_name` 字段在数据中为 None(这是数据问题) -2. `specifications.name` 字段在数据中使用首字母大写(如 "Color"、"Size"),而前端查询时使用小写("color"、"size"),导致 ES term 查询匹配失败 - -## 解决方案 - -采用前端配置方案,根据不同的 `tenant_id` 配置不同的分面字段。配置包括: -- **字段名**(field):ES 中的实际字段名,如 `specifications.Color` -- **显示标签**(label):前端显示的名称,如 "颜色"、"尺寸" -- **容器ID**(containerId):HTML 中用于显示分面的容器 ID,如 `colorTags` -- **查询参数**:size、type、disjunctive 等 - -## 配置文件 - -配置文件位置:`frontend/static/js/tenant_facets_config.js` - -### 配置结构 - -```javascript -const TENANT_FACETS_CONFIG = { - "租户ID": { - specificationFields: [ - { - field: "specifications.字段名", // ES字段名(必须与实际数据匹配,包括大小写) - label: "显示标签", // 前端显示名称 - containerId: "容器ID", // HTML容器ID - size: 20, // 返回的分面值数量 - type: "terms", // 分面类型 - disjunctive: true // 是否支持多选 - } - ] - } -}; -``` - -### 示例配置 - -#### tenant_id=162(使用小写) -```javascript -"162": { - specificationFields: [ - { - field: "specifications.color", - label: "Color", - containerId: "colorTags", - size: 20, - type: "terms", - disjunctive: true - }, - { - field: "specifications.size", - label: "Size", - containerId: "sizeTags", - size: 15, - type: "terms", - disjunctive: true - }, - { - field: "specifications.material", - label: "Material", - containerId: "materialTags", - size: 10, - type: "terms", - disjunctive: true - } - ] -} -``` - -#### tenant_id=170(使用首字母大写,没有material) -```javascript -"170": { - specificationFields: [ - { - field: "specifications.Color", // 注意:首字母大写 - label: "Color", - containerId: "colorTags", - size: 20, - type: "terms", - disjunctive: true - }, - { - field: "specifications.Size", // 注意:首字母大写 - label: "Size", - containerId: "sizeTags", - size: 15, - type: "terms", - disjunctive: true - } - // 注意:170 没有 material 分面 - ] -} -``` - -#### 示例:添加新租户(包含其他规格字段,如重量、包装方式) -```javascript -"新租户ID": { - specificationFields: [ - { - field: "specifications.Weight", // 重量 - label: "Weight", - containerId: "weightTags", // 需要在HTML中添加此容器 - size: 15, - type: "terms", - disjunctive: true - }, - { - field: "specifications.PackageType", // 包装方式 - label: "Package Type", - containerId: "packageTags", // 需要在HTML中添加此容器 - size: 10, - type: "terms", - disjunctive: true - } - ] -} -``` - -## 添加新租户配置步骤 - -1. **确定 ES 数据中的实际字段名** - - 检查 ES 中 `specifications.name` 的实际值(注意大小写) - - 例如:`"Color"` 或 `"color"` 是不同的字段 - -2. **在配置文件中添加配置** - ```javascript - "新租户ID": { - specificationFields: [ - { - field: "specifications.实际字段名", - label: "显示名称", - containerId: "容器ID", - size: 20, - type: "terms", - disjunctive: true - } - ] - } - ``` - -3. **在 HTML 中添加容器**(如果需要新的容器) - 在 `frontend/index.html` 的 Filter Section 中添加: - ```html -