From 267920e5f244a8f20a9aa8838b13d1619ba4db32 Mon Sep 17 00:00:00 2001 From: tangwang Date: Tue, 31 Mar 2026 13:49:20 +0800 Subject: [PATCH] eval docs --- scripts/evaluation/README.md | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------------------------------------------- scripts/evaluation/README_zh.md | 109 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 172 insertions(+), 82 deletions(-) create mode 100644 scripts/evaluation/README_zh.md diff --git a/scripts/evaluation/README.md b/scripts/evaluation/README.md index 21aa5ce..50efc48 100644 --- a/scripts/evaluation/README.md +++ b/scripts/evaluation/README.md @@ -1,124 +1,105 @@ -参考资料: +**Reference Materials:** -1. 搜索接口: +1. Search Interface: ```bash export BASE_URL="${BASE_URL:-http://localhost:6002}" -export TENANT_ID="${TENANT_ID:-163}" # 改成你的租户ID +export TENANT_ID="${TENANT_ID:-163}" # Change to your tenant ID ``` ```bash curl -sS "$BASE_URL/search/" \ -H "Content-Type: application/json" \ -H "X-Tenant-ID: $TENANT_ID" \ -d '{ - "query": "芭比娃娃", + "query": "Barbie doll", "size": 20, "from": 0, "language": "zh" }' ``` -response: +response: { "results": [ { "spu_id": "12345", - "title": "芭比时尚娃娃", - "brief": "高品质芭比娃娃", - "description": "详细描述...", - "vendor": "美泰", - "category": "玩具", - "category_path": "玩具/娃娃/时尚", - "category_name": "时尚", - "category_id": "cat_001", - "category_level": 3, - "category1_name": "玩具", - "category2_name": "娃娃", - "category3_name": "时尚", - "tags": ["娃娃", "玩具", "女孩"], - "price": 89.99, - "compare_at_price": 129.99, - "currency": "USD", + "title": "Barbie Fashion Doll", "image_url": "https://example.com/image.jpg", - "in_stock": true, - "sku_prices": [89.99, 99.99, 109.99], - "sku_weights": [100, 150, 200], - "sku_weight_units": ["g", "g", "g"], - "total_inventory": 500, - "option1_name": "color", - "option2_name": "size", - "option3_name": null, - -2. 重排服务: + "specifications":[], + "skus":[{"sku_id":"... +... + +2. Reranking Service: +```bash curl -X POST "http://localhost:6007/rerank" \ -H "Content-Type: application/json" \ -d '{ - "query": "玩具 芭比", - "docs": ["12PCS 6 Types of Dolls with Bottles", "纯棉T恤 短袖"], + "query": "toy Barbie", + "docs": ["12PCS 6 Types of Dolls with Bottles", "Cotton T-shirt Short Sleeve"], "top_n":386, "normalize": true }' +``` +3. Query by Specific Fields: `es_debug_search.py` -3. 基于指定字段查询:es_debug_search.py - - -主要任务: -1. 评估工具的建立: -注意判断结果好坏,要用统一的评估工具,不要对每个query设定关键词匹配的规则来判断是否符合要求,这样不可扩展,这种方式且容易有误判还是复杂,并且不好扩展到其他搜索词。 -因此要做一个搜索结果评估工具、多个结果对比的工具,供后面的标注集合构建工具调用。工具内部实现可以是调用大模型来判断,说清楚什么叫高相关、基本相关、不相关: - -prompt: -```bash -你是一个电商搜索结果相关性评估助手。请根据用户查询(query)和每个商品的信息,输出该商品的相关性等级。 - -## 相关性等级标准 -Exact 完全相关 — 完全匹配用户搜索需求。 -Partial 部分相关 — 主意图满足(同品类或相近用途),但次要属性(如颜色、风格、尺码等)有偏差或无法确认。 -Irrelevant 不相关 — 品类或用途不符,主诉求未满足。 - - -1. {title1} | {option1_value1}, {option2_value1}, {option3_value1} -2. {title2} | {option1_value2}, {option2_value2}, {option3_value2} -... -50. {title50} | {option1_value50}, {option2_value50}, {option3_value50} +**Main Tasks:** -## 输出格式 -严格输出 {input_nums} 行,每行仅Exact / Partial / Irrelevant三者之一。按顺序对应上述 50 个商品。不要输出任何其他任何信息 -``` +1. **Establish Evaluation Tooling:** + Note: To judge result quality, a unified evaluation tool must be used. Do not define keyword-matching rules per query to determine relevance—this is not scalable, prone to misjudgment, complicated, and difficult to extend to other search terms. + Therefore, build a search result evaluation tool and a multi-result comparison tool, to be called by the subsequent annotation set construction tool. The internal implementation may call an LLM to judge, clearly defining what counts as highly relevant, partially relevant, and irrelevant. + Prompt: + ``` + You are an e-commerce search result relevance evaluation assistant. Based on the user query and each product's information, output the relevance level for the product. -2. 测试集(结果标注)建立: -@queries/queries.txt + ## Relevance Level Criteria + Exact — Fully matches the user's search intent. + Partial — Primary intent satisfied (same category or similar use, basically aligns with search intent), but secondary attributes (e.g., color, style, size) deviate from or cannot be confirmed against user needs. + Irrelevant — Category or use case mismatched, primary intent not satisfied. -对其中每一个query: -1. 召回: -1)参考搜索接口 召回1k结果。 -2)遍历全库,得到每个spu的title,请求重排模型,进行全排序,得到top1w结果。注意重排模型打分一定要做缓存(本地文件缓存即可。query+title->rerank_score)。 -2. 对以上结果,拆分batch请求llm,进行结果标注。 -3. 请你思考如何存储结果、并利于以后的对比、使用、展示。 + 1. {title1} {option1_value1} {option2_value1} {option3_value1} + 2. {title2} {option1_value2} {option2_value2}, {option3_value2} + ... + 50. {title50} {option1_value50} {option2_value50} {option3_value50} + ## Output Format + Strictly output {input_nums} lines, each line containing exactly one of Exact / Partial / Irrelevant. They must correspond sequentially to the 50 products above. Do not output any other information. + ``` -3. 评估工具页面: -请你设计一个搜索评估交互页面。端口6010。 -页面主题:上方是搜索框,如果发起搜索,那么下方给出本次结果的总体指标以及top100结果(允许翻页) +2. **Test Set (Result Annotation) Construction:** + Source: `@queries/queries.txt` -总体指标: -| 指标 | 含义 | -|------|------| -| **P@5, P@10, P@20, P@50** | 前 K 个结果中「仅 3 相关」的精确率 | -| **P@5_2_3 ~ P@50_2_3** | 前 K 个结果中「2 和 3 都算相关」的精确率 | -| **MAP_3** | 仅 3 相关时的 Average Precision(单 query) | -| **MAP_2_3** | 2 和 3 都相关时的 Average Precision | + For each query: + 1. Retrieval: + - Use the search interface to retrieve 1k results. + - Traverse the entire product database, obtain the title of each SPU, call the reranking model to perform full ranking, and obtain the top 10k results. Note: Reranking model scores must be cached (local file cache is sufficient; key = query + title -> rerank_score). + 2. For the above results, split into batches and call the LLM to annotate the results. + 3. Consider how to store the results to facilitate future comparison, usage, and presentation. -结果列表: -按行列下来,每行左侧给每个结果找到标注值(三个等级。对结果也可以颜色标记),展示图片,title.en+title.en+首个sku的option1/2/3_value(分三行展示,这三行和左侧的图片并列) +3. **Evaluation Tool Web Page:** + Design a search evaluation interactive page on port 6010. + Page theme: a search box at the top. When a search is issued, the page below shows overall metrics for this result set and the top 100 results (with pagination). + Overall Metrics: + | Metric | Meaning | + |--------|---------| + | **P@5, P@10, P@20, P@50** | Precision at top K where only level 3 (Exact) counts as relevant | + | **P@5_2_3 ~ P@50_2_3** | Precision at top K where both level 2 (Partial) and level 3 (Exact) count as relevant | + | **MAP_3** | Mean Average Precision when only level 3 (Exact) is relevant (single query) | + | **MAP_2_3** | Mean Average Precision when both level 2 and level 3 are relevant | -评测页面最左侧: -queries默认是queries/queries.txt,填入左侧列表框,点击其中任何一个发起搜索。 + Results List: + Displayed row by row. For each row, on the left side show the annotation value (three levels, also color-coded). Display the image, title (en), and the first SKU's option1/2/3 values (shown in three lines, these three lines aligned horizontally with the image on the left). -4. 批量评估工具脚本 -给一个批量执行脚本,跑完所有query,进行各维度结果的汇总,生成报告,报告名称带上时间标记和一些关键信息。 + Leftmost part of the evaluation page: + Queries default to `queries/queries.txt`. Populate them in a list box on the left. Click any query to trigger its search. +4. **Batch Evaluation Tool:** + Provide a batch execution script. + Additionally, create a batch evaluation page. Click a "Batch Evaluation" button to sequentially search for all queries, then aggregate overall metrics and generate a report. The report name should include a timestamp and some key information. Also record the main search program's `config.yaml` at that time. + Carefully design how to switch between the two modes (single query evaluation vs batch evaluation) on the same port, supporting these two different interactive contents. + Batch evaluation focuses on the aggregated metrics across all search terms. + It needs to record the test environment timestamp, the corresponding configuration file, and the results. All historical evaluation records should be saved, and for each evaluation result, it should be possible to look up the corresponding configuration file and associated metrics. +The above is my overall design, but there may be gaps. You should understand my requirements at a higher level. You have sufficient freedom to adjust the design appropriately, drawing on best practices in automated search evaluation frameworks, to produce a superior design and implementation. \ No newline at end of file diff --git a/scripts/evaluation/README_zh.md b/scripts/evaluation/README_zh.md new file mode 100644 index 0000000..02cee86 --- /dev/null +++ b/scripts/evaluation/README_zh.md @@ -0,0 +1,109 @@ +参考资料: + +1. 搜索接口: + +```bash +export BASE_URL="${BASE_URL:-http://localhost:6002}" +export TENANT_ID="${TENANT_ID:-163}" # 改成你的租户ID +``` +```bash +curl -sS "$BASE_URL/search/" \ + -H "Content-Type: application/json" \ + -H "X-Tenant-ID: $TENANT_ID" \ + -d '{ + "query": "芭比娃娃", + "size": 20, + "from": 0, + "language": "zh" + }' +``` + +response: +{ + "results": [ + { + "spu_id": "12345", + "title": "芭比时尚娃娃", + "image_url": "https://example.com/image.jpg", + "specifications":[], + "skus":[{"sku_id":" ... +... + +2. 重排服务: +curl -X POST "http://localhost:6007/rerank" \ + -H "Content-Type: application/json" \ + -d '{ + "query": "玩具 芭比", + "docs": ["12PCS 6 Types of Dolls with Bottles", "纯棉T恤 短袖"], + "top_n":386, + "normalize": true + }' + + +3. 基于指定字段查询:es_debug_search.py + + +主要任务: +1. 评估工具的建立: +注意判断结果好坏,要用统一的评估工具,不要对每个query设定关键词匹配的规则来判断是否符合要求,这样不可扩展,这种方式且容易有误判还是复杂,并且不好扩展到其他搜索词。 +因此要做一个搜索结果评估工具、多个结果对比的工具,供后面的标注集合构建工具调用。工具内部实现可以是调用大模型来判断,说清楚什么叫高相关、基本相关、不相关: + +prompt: +```bash +你是一个电商搜索结果相关性评估助手。请根据用户查询(query)和每个商品的信息,输出该商品的相关性等级。 + +## 相关性等级标准 +Exact 完全相关 — 完全匹配用户搜索需求。 +Partial 部分相关 — 主意图满足(同品类或相近用途,基本上符合搜索意图),但次要属性(如颜色、风格、尺码等)跟用户需求有偏差或无法确认。 +Irrelevant 不相关 — 品类或用途不符,主诉求未满足。 + +1. {title1} {option1_value1} {option2_value1} {option3_value1} +2. {title2} {option1_value2} {option2_value2}, {option3_value2} +... +50. {title50} {option1_value50} {option2_value50} {option3_value50} + +## 输出格式 +严格输出 {input_nums} 行,每行仅Exact / Partial / Irrelevant三者之一。按顺序对应上述 50 个商品。不要输出任何其他任何信息 +``` + + +2. 测试集(结果标注)建立: +@queries/queries.txt + +对其中每一个query: +1. 召回: +1)参考搜索接口 召回1k结果。 +2)遍历全库,得到每个spu的title,请求重排模型,进行全排序,得到top1w结果。注意重排模型打分一定要做缓存(本地文件缓存即可。query+title->rerank_score)。 +2. 对以上结果,拆分batch请求llm,进行结果标注。 +3. 请你思考如何存储结果、并利于以后的对比、使用、展示。 + + +3. 评估工具页面: +请你设计一个搜索评估交互页面。端口6010。 +页面主题:上方是搜索框,如果发起搜索,那么下方给出本次结果的总体指标以及top100结果(允许翻页) + +总体指标: +| 指标 | 含义 | +|------|------| +| **P@5, P@10, P@20, P@50** | 前 K 个结果中「仅 3 相关」的精确率 | +| **P@5_2_3 ~ P@50_2_3** | 前 K 个结果中「2 和 3 都算相关」的精确率 | +| **MAP_3** | 仅 3 相关时的 Average Precision(单 query) | +| **MAP_2_3** | 2 和 3 都相关时的 Average Precision | + +结果列表: +按行列下来,每行左侧给每个结果找到标注值(三个等级。对结果也可以颜色标记),展示图片,title.en+title.en+首个sku的option1/2/3_value(分三行展示,这三行和左侧的图片并列) + + +评测页面最左侧: +queries默认是queries/queries.txt,填入左侧列表框,点击其中任何一个发起搜索。 + +4. 批量评估工具 + +给一个批量执行脚本, + +这里要新增一个批量评估的页面。点击批量评估的按钮,对所有搜索词依次发起搜索,最后汇总总体的评估指标,生成报告,报告名称带上时间标记和一些关键信息。并且记录当时的主搜索程序的config.yaml。 +你需要精心地设计如何切换两种模式,通过同一个端口承载这两种不同交互的内容。 +批量评估关注的是所有搜索词总体的评估指标。 +需要记录测试环境时间以及当时的配置文件,以及对应的结果。要保存历次的评估记录,并能查到每一次评估结果对应的配置文件有相关的指标 + +以上是我的总体设计,但有不周全的地方。你要站在更高的层次理解我的需求,你有足够的自由可以适当调整设计,基于你所了解的自动化搜索评估框架的最佳实践,做出更优秀的设计和更好的实现。 \ No newline at end of file -- libgit2 0.21.2