Blame view

scripts/evaluation/README.md 5.48 KB
267920e5   tangwang   eval docs
1
  **Reference Materials:**
3b35f139   tangwang   search evalution
2
  
267920e5   tangwang   eval docs
3
  1. Search Interface:
3b35f139   tangwang   search evalution
4
5
6
  
  ```bash
  export BASE_URL="${BASE_URL:-http://localhost:6002}"
267920e5   tangwang   eval docs
7
  export TENANT_ID="${TENANT_ID:-163}"   # Change to your tenant ID
3b35f139   tangwang   search evalution
8
9
10
11
12
13
  ```
  ```bash
  curl -sS "$BASE_URL/search/" \
    -H "Content-Type: application/json" \
    -H "X-Tenant-ID: $TENANT_ID" \
    -d '{
267920e5   tangwang   eval docs
14
      "query": "Barbie doll",
3b35f139   tangwang   search evalution
15
16
17
18
19
20
      "size": 20,
      "from": 0,
      "language": "zh"
    }'
  ```
  
267920e5   tangwang   eval docs
21
  response:
3b35f139   tangwang   search evalution
22
23
24
25
  {
    "results": [
      {
        "spu_id": "12345",
267920e5   tangwang   eval docs
26
        "title": "Barbie Fashion Doll",
3b35f139   tangwang   search evalution
27
        "image_url": "https://example.com/image.jpg",
267920e5   tangwang   eval docs
28
29
30
31
32
33
        "specifications":[],
        "skus":[{"sku_id":"...
  ...
  
  2. Reranking Service:
  ```bash
3b35f139   tangwang   search evalution
34
35
36
  curl -X POST "http://localhost:6007/rerank" \
    -H "Content-Type: application/json" \
    -d '{
267920e5   tangwang   eval docs
37
38
      "query": "toy Barbie",
      "docs": ["12PCS 6 Types of Dolls with Bottles", "Cotton T-shirt Short Sleeve"],
3b35f139   tangwang   search evalution
39
40
41
      "top_n":386,
      "normalize": true
    }'
267920e5   tangwang   eval docs
42
  ```
3b35f139   tangwang   search evalution
43
  
267920e5   tangwang   eval docs
44
  3. Query by Specific Fields: `es_debug_search.py`
3b35f139   tangwang   search evalution
45
  
267920e5   tangwang   eval docs
46
  **Main Tasks:**
3b35f139   tangwang   search evalution
47
  
267920e5   tangwang   eval docs
48
49
50
  1. **Establish Evaluation Tooling:**  
     Note: To judge result quality, a unified evaluation tool must be used. Do not define keyword-matching rules per query to determine relevance—this is not scalable, prone to misjudgment, complicated, and difficult to extend to other search terms.  
     Therefore, build a search result evaluation tool and a multi-result comparison tool, to be called by the subsequent annotation set construction tool. The internal implementation may call an LLM to judge, clearly defining what counts as highly relevant, partially relevant, and irrelevant.
3b35f139   tangwang   search evalution
51
  
267920e5   tangwang   eval docs
52
53
54
     Prompt:
     ```
     You are an e-commerce search result relevance evaluation assistant. Based on the user query and each product's information, output the relevance level for the product.
3b35f139   tangwang   search evalution
55
  
267920e5   tangwang   eval docs
56
57
58
59
     ## Relevance Level Criteria
     Exact — Fully matches the user's search intent.
     Partial — Primary intent satisfied (same category or similar use, basically aligns with search intent), but secondary attributes (e.g., color, style, size) deviate from or cannot be confirmed against user needs.
     Irrelevant — Category or use case mismatched, primary intent not satisfied.
3b35f139   tangwang   search evalution
60
  
267920e5   tangwang   eval docs
61
62
63
64
     1. {title1} {option1_value1} {option2_value1} {option3_value1}
     2. {title2} {option1_value2} {option2_value2}, {option3_value2}
     ...
     50. {title50} {option1_value50} {option2_value50} {option3_value50}
3b35f139   tangwang   search evalution
65
  
267920e5   tangwang   eval docs
66
67
68
     ## Output Format
     Strictly output {input_nums} lines, each line containing exactly one of Exact / Partial / Irrelevant. They must correspond sequentially to the 50 products above. Do not output any other information.
     ```
3b35f139   tangwang   search evalution
69
  
267920e5   tangwang   eval docs
70
71
  2. **Test Set (Result Annotation) Construction:**  
     Source: `@queries/queries.txt`
3b35f139   tangwang   search evalution
72
  
267920e5   tangwang   eval docs
73
74
75
76
77
78
     For each query:
     1. Retrieval:
        - Use the search interface to retrieve 1k results.
        - Traverse the entire product database, obtain the title of each SPU, call the reranking model to perform full ranking, and obtain the top 10k results. Note: Reranking model scores must be cached (local file cache is sufficient; key = query + title -> rerank_score).
     2. For the above results, split into batches and call the LLM to annotate the results.
     3. Consider how to store the results to facilitate future comparison, usage, and presentation.
3b35f139   tangwang   search evalution
79
  
267920e5   tangwang   eval docs
80
81
82
  3. **Evaluation Tool Web Page:**  
     Design a search evaluation interactive page on port 6010.  
     Page theme: a search box at the top. When a search is issued, the page below shows overall metrics for this result set and the top 100 results (with pagination).
3b35f139   tangwang   search evalution
83
  
267920e5   tangwang   eval docs
84
85
86
87
88
89
90
     Overall Metrics:
     | Metric | Meaning |
     |--------|---------|
     | **P@5, P@10, P@20, P@50** | Precision at top K where only level 3 (Exact) counts as relevant |
     | **P@5_2_3 ~ P@50_2_3** | Precision at top K where both level 2 (Partial) and level 3 (Exact) count as relevant |
     | **MAP_3** | Mean Average Precision when only level 3 (Exact) is relevant (single query) |
     | **MAP_2_3** | Mean Average Precision when both level 2 and level 3 are relevant |
3b35f139   tangwang   search evalution
91
  
267920e5   tangwang   eval docs
92
93
     Results List:  
     Displayed row by row. For each row, on the left side show the annotation value (three levels, also color-coded). Display the image, title (en), and the first SKU's option1/2/3 values (shown in three lines, these three lines aligned horizontally with the image on the left).
3b35f139   tangwang   search evalution
94
  
267920e5   tangwang   eval docs
95
96
     Leftmost part of the evaluation page:  
     Queries default to `queries/queries.txt`. Populate them in a list box on the left. Click any query to trigger its search.
3b35f139   tangwang   search evalution
97
  
267920e5   tangwang   eval docs
98
99
100
101
102
103
  4. **Batch Evaluation Tool:**  
     Provide a batch execution script.  
     Additionally, create a batch evaluation page. Click a "Batch Evaluation" button to sequentially search for all queries, then aggregate overall metrics and generate a report. The report name should include a timestamp and some key information. Also record the main search program's `config.yaml` at that time.  
     Carefully design how to switch between the two modes (single query evaluation vs batch evaluation) on the same port, supporting these two different interactive contents.  
     Batch evaluation focuses on the aggregated metrics across all search terms.  
     It needs to record the test environment timestamp, the corresponding configuration file, and the results. All historical evaluation records should be saved, and for each evaluation result, it should be possible to look up the corresponding configuration file and associated metrics.
3b35f139   tangwang   search evalution
104
  
267920e5   tangwang   eval docs
105
  The above is my overall design, but there may be gaps. You should understand my requirements at a higher level. You have sufficient freedom to adjust the design appropriately, drawing on best practices in automated search evaluation frameworks, to produce a superior design and implementation.