267920e5
tangwang
eval docs
|
1
|
**Reference Materials:**
|
3b35f139
tangwang
search evalution
|
2
|
|
267920e5
tangwang
eval docs
|
3
|
1. Search Interface:
|
3b35f139
tangwang
search evalution
|
4
5
6
|
```bash
export BASE_URL="${BASE_URL:-http://localhost:6002}"
|
267920e5
tangwang
eval docs
|
7
|
export TENANT_ID="${TENANT_ID:-163}" # Change to your tenant ID
|
3b35f139
tangwang
search evalution
|
8
9
10
11
12
13
|
```
```bash
curl -sS "$BASE_URL/search/" \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: $TENANT_ID" \
-d '{
|
267920e5
tangwang
eval docs
|
14
|
"query": "Barbie doll",
|
3b35f139
tangwang
search evalution
|
15
16
17
18
19
20
|
"size": 20,
"from": 0,
"language": "zh"
}'
```
|
267920e5
tangwang
eval docs
|
21
|
response:
|
3b35f139
tangwang
search evalution
|
22
23
24
25
|
{
"results": [
{
"spu_id": "12345",
|
267920e5
tangwang
eval docs
|
26
|
"title": "Barbie Fashion Doll",
|
3b35f139
tangwang
search evalution
|
27
|
"image_url": "https://example.com/image.jpg",
|
267920e5
tangwang
eval docs
|
28
29
30
31
32
33
|
"specifications":[],
"skus":[{"sku_id":"...
...
2. Reranking Service:
```bash
|
3b35f139
tangwang
search evalution
|
34
35
36
|
curl -X POST "http://localhost:6007/rerank" \
-H "Content-Type: application/json" \
-d '{
|
267920e5
tangwang
eval docs
|
37
38
|
"query": "toy Barbie",
"docs": ["12PCS 6 Types of Dolls with Bottles", "Cotton T-shirt Short Sleeve"],
|
3b35f139
tangwang
search evalution
|
39
40
41
|
"top_n":386,
"normalize": true
}'
|
267920e5
tangwang
eval docs
|
42
|
```
|
3b35f139
tangwang
search evalution
|
43
|
|
267920e5
tangwang
eval docs
|
44
|
3. Query by Specific Fields: `es_debug_search.py`
|
3b35f139
tangwang
search evalution
|
45
|
|
267920e5
tangwang
eval docs
|
46
|
**Main Tasks:**
|
3b35f139
tangwang
search evalution
|
47
|
|
267920e5
tangwang
eval docs
|
48
49
50
|
1. **Establish Evaluation Tooling:**
Note: To judge result quality, a unified evaluation tool must be used. Do not define keyword-matching rules per query to determine relevance—this is not scalable, prone to misjudgment, complicated, and difficult to extend to other search terms.
Therefore, build a search result evaluation tool and a multi-result comparison tool, to be called by the subsequent annotation set construction tool. The internal implementation may call an LLM to judge, clearly defining what counts as highly relevant, partially relevant, and irrelevant.
|
3b35f139
tangwang
search evalution
|
51
|
|
267920e5
tangwang
eval docs
|
52
53
54
|
Prompt:
```
You are an e-commerce search result relevance evaluation assistant. Based on the user query and each product's information, output the relevance level for the product.
|
3b35f139
tangwang
search evalution
|
55
|
|
267920e5
tangwang
eval docs
|
56
57
58
59
|
## Relevance Level Criteria
Exact — Fully matches the user's search intent.
Partial — Primary intent satisfied (same category or similar use, basically aligns with search intent), but secondary attributes (e.g., color, style, size) deviate from or cannot be confirmed against user needs.
Irrelevant — Category or use case mismatched, primary intent not satisfied.
|
3b35f139
tangwang
search evalution
|
60
|
|
267920e5
tangwang
eval docs
|
61
62
63
64
|
1. {title1} {option1_value1} {option2_value1} {option3_value1}
2. {title2} {option1_value2} {option2_value2}, {option3_value2}
...
50. {title50} {option1_value50} {option2_value50} {option3_value50}
|
3b35f139
tangwang
search evalution
|
65
|
|
267920e5
tangwang
eval docs
|
66
67
68
|
## Output Format
Strictly output {input_nums} lines, each line containing exactly one of Exact / Partial / Irrelevant. They must correspond sequentially to the 50 products above. Do not output any other information.
```
|
3b35f139
tangwang
search evalution
|
69
|
|
267920e5
tangwang
eval docs
|
70
71
|
2. **Test Set (Result Annotation) Construction:**
Source: `@queries/queries.txt`
|
3b35f139
tangwang
search evalution
|
72
|
|
267920e5
tangwang
eval docs
|
73
74
75
76
77
78
|
For each query:
1. Retrieval:
- Use the search interface to retrieve 1k results.
- Traverse the entire product database, obtain the title of each SPU, call the reranking model to perform full ranking, and obtain the top 10k results. Note: Reranking model scores must be cached (local file cache is sufficient; key = query + title -> rerank_score).
2. For the above results, split into batches and call the LLM to annotate the results.
3. Consider how to store the results to facilitate future comparison, usage, and presentation.
|
3b35f139
tangwang
search evalution
|
79
|
|
267920e5
tangwang
eval docs
|
80
81
82
|
3. **Evaluation Tool Web Page:**
Design a search evaluation interactive page on port 6010.
Page theme: a search box at the top. When a search is issued, the page below shows overall metrics for this result set and the top 100 results (with pagination).
|
3b35f139
tangwang
search evalution
|
83
|
|
267920e5
tangwang
eval docs
|
84
85
86
87
88
89
90
|
Overall Metrics:
| Metric | Meaning |
|--------|---------|
| **P@5, P@10, P@20, P@50** | Precision at top K where only level 3 (Exact) counts as relevant |
| **P@5_2_3 ~ P@50_2_3** | Precision at top K where both level 2 (Partial) and level 3 (Exact) count as relevant |
| **MAP_3** | Mean Average Precision when only level 3 (Exact) is relevant (single query) |
| **MAP_2_3** | Mean Average Precision when both level 2 and level 3 are relevant |
|
3b35f139
tangwang
search evalution
|
91
|
|
267920e5
tangwang
eval docs
|
92
93
|
Results List:
Displayed row by row. For each row, on the left side show the annotation value (three levels, also color-coded). Display the image, title (en), and the first SKU's option1/2/3 values (shown in three lines, these three lines aligned horizontally with the image on the left).
|
3b35f139
tangwang
search evalution
|
94
|
|
267920e5
tangwang
eval docs
|
95
96
|
Leftmost part of the evaluation page:
Queries default to `queries/queries.txt`. Populate them in a list box on the left. Click any query to trigger its search.
|
3b35f139
tangwang
search evalution
|
97
|
|
267920e5
tangwang
eval docs
|
98
99
100
101
102
103
|
4. **Batch Evaluation Tool:**
Provide a batch execution script.
Additionally, create a batch evaluation page. Click a "Batch Evaluation" button to sequentially search for all queries, then aggregate overall metrics and generate a report. The report name should include a timestamp and some key information. Also record the main search program's `config.yaml` at that time.
Carefully design how to switch between the two modes (single query evaluation vs batch evaluation) on the same port, supporting these two different interactive contents.
Batch evaluation focuses on the aggregated metrics across all search terms.
It needs to record the test environment timestamp, the corresponding configuration file, and the results. All historical evaluation records should be saved, and for each evaluation result, it should be possible to look up the corresponding configuration file and associated metrics.
|
3b35f139
tangwang
search evalution
|
104
|
|
267920e5
tangwang
eval docs
|
105
|
The above is my overall design, but there may be gaps. You should understand my requirements at a higher level. You have sufficient freedom to adjust the design appropriately, drawing on best practices in automated search evaluation frameworks, to produce a superior design and implementation.
|