scripts/evaluation/README.md

**Reference Materials:**
1. Search Interface:
```bash
export BASE_URL="${BASE_URL:-http://localhost:6002}"
export TENANT_ID="${TENANT_ID:-163}"   # Change to your tenant ID
```
```bash
curl -sS "$BASE_URL/search/" \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: $TENANT_ID" \
  -d '{
    "query": "Barbie doll",
    "size": 20,
    "from": 0,
    "language": "zh"
  }'
```
response:
{
  "results": [
    {
      "spu_id": "12345",
      "title": "Barbie Fashion Doll",
      "image_url": "https://example.com/image.jpg",
      "specifications":[],
      "skus":[{"sku_id":"...
...
2. Reranking Service:
```bash
curl -X POST "http://localhost:6007/rerank" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "toy Barbie",
    "docs": ["12PCS 6 Types of Dolls with Bottles", "Cotton T-shirt Short Sleeve"],
    "top_n":386,
    "normalize": true
  }'
```
3. Query by Specific Fields: `es_debug_search.py`
**Main Tasks:**
1. **Establish Evaluation Tooling:**  
   Note: To judge result quality, a unified evaluation tool must be used. Do not define keyword-matching rules per query to determine relevance—this is not scalable, prone to misjudgment, complicated, and difficult to extend to other search terms.  
   Therefore, build a search result evaluation tool and a multi-result comparison tool, to be called by the subsequent annotation set construction tool. The internal implementation may call an LLM to judge, clearly defining what counts as highly relevant, partially relevant, and irrelevant.
   Prompt:
   ```
   You are an e-commerce search result relevance evaluation assistant. Based on the user query and each product's information, output the relevance level for the product.
   ## Relevance Level Criteria
   Exact — Fully matches the user's search intent.
   Partial — Primary intent satisfied (same category or similar use, basically aligns with search intent), but secondary attributes (e.g., color, style, size) deviate from or cannot be confirmed against user needs.
   Irrelevant — Category or use case mismatched, primary intent not satisfied.
   1. {title1} {option1_value1} {option2_value1} {option3_value1}
   2. {title2} {option1_value2} {option2_value2}, {option3_value2}
   ...
   50. {title50} {option1_value50} {option2_value50} {option3_value50}
   ## Output Format
   Strictly output {input_nums} lines, each line containing exactly one of Exact / Partial / Irrelevant. They must correspond sequentially to the 50 products above. Do not output any other information.
   ```
2. **Test Set (Result Annotation) Construction:**  
   Source: `@queries/queries.txt`
   For each query:
   1. Retrieval:
      - Use the search interface to retrieve 1k results.
      - Traverse the entire product database, obtain the title of each SPU, call the reranking model to perform full ranking, and obtain the top 10k results. Note: Reranking model scores must be cached (local file cache is sufficient; key = query + title -> rerank_score).
   2. For the above results, split into batches and call the LLM to annotate the results.
   3. Consider how to store the results to facilitate future comparison, usage, and presentation.
3. **Evaluation Tool Web Page:**  
   Design a search evaluation interactive page on port 6010.  
   Page theme: a search box at the top. When a search is issued, the page below shows overall metrics for this result set and the top 100 results (with pagination).
   Overall Metrics:
   | Metric | Meaning |
   |--------|---------|
   | **P@5, P@10, P@20, P@50** | Precision at top K where only level 3 (Exact) counts as relevant |
   | **P@5_2_3 ～ P@50_2_3** | Precision at top K where both level 2 (Partial) and level 3 (Exact) count as relevant |
   | **MAP_3** | Mean Average Precision when only level 3 (Exact) is relevant (single query) |
   | **MAP_2_3** | Mean Average Precision when both level 2 and level 3 are relevant |
   Results List:  
   Displayed row by row. For each row, on the left side show the annotation value (three levels, also color-coded). Display the image, title (en), and the first SKU's option1/2/3 values (shown in three lines, these three lines aligned horizontally with the image on the left).
   Leftmost part of the evaluation page:  
   Queries default to `queries/queries.txt`. Populate them in a list box on the left. Click any query to trigger its search.
4. **Batch Evaluation Tool:**  
   Provide a batch execution script.  
   Additionally, create a batch evaluation page. Click a "Batch Evaluation" button to sequentially search for all queries, then aggregate overall metrics and generate a report. The report name should include a timestamp and some key information. Also record the main search program's `config.yaml` at that time.  
   Carefully design how to switch between the two modes (single query evaluation vs batch evaluation) on the same port, supporting these two different interactive contents.  
   Batch evaluation focuses on the aggregated metrics across all search terms.  
   It needs to record the test environment timestamp, the corresponding configuration file, and the results. All historical evaluation records should be saved, and for each evaluation result, it should be possible to look up the corresponding configuration file and associated metrics.
The above is my overall design, but there may be gaps. You should understand my requirements at a higher level. You have sufficient freedom to adjust the design appropriately, drawing on best practices in automated search evaluation frameworks, to produce a superior design and implementation.