Commit acf1349c2b29f77dcd2a26a4118ea89db088d8a8

Authored by tangwang
1 parent ca91352a

fake 批量导入数据的脚步 ( 多款式 )

脚本:scripts/csv_to_excel_multi_variant.py

主要功能:
单一款式商品(S 类型)- 30%
商品属性为 S
不填写 option1/option2/option3
包含所有商品信息(标题、描述、价格、库存等)
多款式商品(M+P 类型)- 70%
M 行(商品主体):
商品属性为 M
填写商品主体信息(标题、描述、SEO、分类等)
option1="color", option2="size", option3="material"
不填写价格、库存、SKU 等子款式信息
P 行(子款式):
商品属性为 P
商品标题与 M 行一致
option1/2/3 填写具体值(color、size、material 的笛卡尔积)
每个 SKU 有独立的价格、库存、SKU 编码等
多款式商品生成规则:
Color(颜色):从 color1 到 color30 中随机选择 2-10 个
Size(尺寸):从 1-30 中随机选择 4-8 个
Material(材质):从商品标题按空格分割后的最后一个字符串提取(去掉特殊字符)
笛卡尔积:生成所有组合的 P 行(例如:3 个颜色 × 5 个尺寸 × 1 个材质 = 15 个 SKU)
CLAUDE.md
... ... @@ -4,18 +4,25 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
4 4  
5 5 ## Project Overview
6 6  
7   -This is a **Search Engine SaaS** project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking.
  7 +This is a **Search Engine SaaS** platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities.
8 8  
9 9 **Tech Stack:**
10   -- Elasticsearch as the search engine backend
  10 +- Elasticsearch 8.x as the search engine backend
11 11 - MySQL (Shoplazza database) as the primary data source
12   -- Python for data processing and ingestion
  12 +- Python 3.10 with PyTorch/CUDA support
13 13 - BGE-M3 model for text embeddings (1024-dim vectors)
14 14 - CN-CLIP (ViT-H-14) for image embeddings
  15 +- FastAPI for REST API layer
15 16  
16   -## Database Configuration
  17 +## Development Environment
17 18  
18   -**Shoplazza Production Database:**
  19 +**Required Environment Setup:**
  20 +```bash
  21 +source /home/tw/miniconda3/etc/profile.d/conda.sh
  22 +conda activate searchengine
  23 +```
  24 +
  25 +**Database Configuration:**
19 26 ```
20 27 host: 120.79.247.228
21 28 port: 3316
... ... @@ -24,85 +31,217 @@ username: saas
24 31 password: P89cZHS5d7dFyc9R
25 32 ```
26 33  
27   -**Main Tables:**
28   -- `shoplazza_product_sku` - SKU level product data
29   -- `shoplazza_product_spu` - SPU level product data
  34 +## Common Development Commands
30 35  
31   -## Architecture
  36 +### Environment Setup
  37 +```bash
  38 +# Complete environment setup
  39 +./setup.sh
32 40  
33   -### Data Flow
34   -1. **Data Source (MySQL)** → Main tables (`shoplazza_product_sku`, `shoplazza_product_spu`) + tenant extension tables
35   -2. **Indexer** → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch
36   -3. **Query Parser** → Query rewriting, translation, text embedding conversion
37   -4. **Searcher** → Executes searches against Elasticsearch with configurable ranking
  41 +# Install Python dependencies
  42 +pip install -r requirements.txt
  43 +```
38 44  
39   -### Multi-Tenant Design
40   -Each tenant has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with tenant extension tables during indexing.
  45 +### Data Management
  46 +```bash
  47 +# Generate test data (Tenant1 Mock + Tenant2 CSV)
  48 +./scripts/mock_data.sh
41 49  
42   -### Configuration System
  50 +# Ingest data to Elasticsearch
  51 +./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true
  52 +python main.py ingest data.csv --limit 1000 --batch-size 50
  53 +```
43 54  
44   -The system uses two types of configurations per tenant:
  55 +### Running Services
  56 +```bash
  57 +# Start all services (production)
  58 +./run.sh
45 59  
46   -1. **Application Structure Config** (`IndexerConfig`) - Defines:
47   - - Input field mappings from MySQL to Elasticsearch
48   - - Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.)
49   - - Which fields require preprocessing (embeddings, transformations)
  60 +# Start development server with auto-reload
  61 +./scripts/start_backend.sh
  62 +python main.py serve --host 0.0.0.0 --port 6002 --reload
50 63  
51   -2. **Index Structure Config** - Defines:
52   - - Elasticsearch field mappings and analyzers
53   - - Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
54   - - Query domain definitions (default, category_name, title, brand_name, etc.)
55   - - BM25 parameters and similarity configurations
  64 +# Start frontend debugging UI
  65 +./scripts/start_frontend.sh
  66 +```
56 67  
57   -### Query Processing
  68 +### Testing
  69 +```bash
  70 +# Run all tests
  71 +python -m pytest tests/
58 72  
59   -The `queryParser` performs:
60   -1. **Query Rewriting** - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections)
61   -2. **Translation** - Language detection and translation to support multi-language search (e.g., zh↔en)
62   -3. **Text Embedding** - Converts query text to vectors when vector search is enabled
  73 +# Run specific test types
  74 +python -m pytest tests/unit/ # Unit tests
  75 +python -m pytest tests/integration/ # Integration tests
  76 +python -m pytest -m "api" # API tests only
63 77  
64   -### Search and Ranking
  78 +# Test search from command line
  79 +python main.py search "query" --tenant-id 1 --size 10
  80 +```
65 81  
66   -The `searcher` supports:
67   -- Boolean operators: AND, OR, RANK, ANDNOT with parentheses
68   -- Operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
69   -- Configurable ranking expressions for the `default` domain:
70   - - Example: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`
71   - - Combines BM25 text relevance, embedding similarity, product scores, and time decay
  82 +### Development Utilities
  83 +```bash
  84 +# Stop all services
  85 +./scripts/stop.sh
72 86  
73   -### Embedding Modules
  87 +# Test environment (for CI/development)
  88 +./scripts/start_test_environment.sh
  89 +./scripts/stop_test_environment.sh
74 90  
75   -**Text Embedding** - Uses BGE-M3 model (`Xorbits/bge-m3`):
76   -- Singleton pattern with thread-safe initialization
77   -- Generates 1024-dimensional vectors
78   -- Configured for GPU/CUDA acceleration
  91 +# Install server dependencies
  92 +./scripts/install_server_deps.sh
  93 +```
79 94  
80   -**Image Embedding** - Uses CN-CLIP model (ViT-H-14):
81   -- Downloads and validates images from URLs
82   -- Preprocesses images (resize, RGB conversion)
83   -- Generates 1024-dimensional vectors
84   -- Supports both local and remote images
  95 +## Architecture Overview
  96 +
  97 +### Core Components
  98 +```
  99 +/data/tw/SearchEngine/
  100 +├── api/ # FastAPI REST API service (port 6002)
  101 +├── config/ # Configuration management system
  102 +├── indexer/ # MySQL → Elasticsearch data pipeline
  103 +├── search/ # Search engine and ranking logic
  104 +├── query/ # Query parsing, translation, rewriting
  105 +├── embeddings/ # ML models (BGE-M3, CN-CLIP)
  106 +├── scripts/ # Automation and utility scripts
  107 +├── utils/ # Shared utilities (ES client, etc.)
  108 +├── frontend/ # Simple debugging UI
  109 +├── mappings/ # Elasticsearch index mappings
  110 +└── tests/ # Unit and integration tests
  111 +```
  112 +
  113 +### Data Flow Architecture
  114 +**Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend
  115 +
  116 +1. **Data Source Layer**:
  117 + - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables
  118 + - Tenant-specific extension tables for custom attributes and multi-language fields
  119 +
  120 +2. **Indexing Layer** (`indexer/`):
  121 + - Reads from MySQL, applies transformations with embeddings
  122 + - Uses `DataTransformer` and `IndexingPipeline` for batch processing
  123 + - Supports both full and incremental indexing with embedding caching
  124 +
  125 +3. **Query Processing Layer** (`query/`):
  126 + - `QueryParser`: Handles query rewriting, translation, and text embedding conversion
  127 + - Multi-language support with automatic detection and translation
  128 + - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK`
  129 +
  130 +4. **Search Engine Layer** (`search/`):
  131 + - `Searcher`: Executes hybrid searches combining BM25 and dense vectors
  132 + - Configurable ranking expressions with function_score support
  133 + - Multi-tenant isolation via `tenant_id` field
  134 +
  135 +5. **API Layer** (`api/`):
  136 + - FastAPI service on port 6002 with multi-tenant support
  137 + - Text search: `POST /search/`
  138 + - Image search: `POST /image-search/`
  139 + - Tenant identification via `X-Tenant-ID` header
85 140  
86   -## Test Data
87   -、
88   -**Tenant1 Test Dataset:**
89   -- Location: `data/tenant1/goods_with_pic.5years_congku.csv.shuf.1w`
90   -- Contains 10,000 shuffled product records with images
91   -- Processing script: `data/tenant1/task2_process_goods.py`
92   - - Extracts product data from MySQL
93   - - Maps images from filebank database
94   - - Creates inverted index (URL → SKU list)
  141 +### Multi-Tenant Configuration System
95 142  
96   -## Key Implementation Notes
  143 +The system uses centralized configuration through `config/config.yaml`:
97 144  
98   -1. **Data Sync:** Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes.
  145 +1. **Field Configuration** (`config/field_types.py`):
  146 + - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc.
  147 + - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese
  148 + - Required fields and preprocessing rules
99 149  
100   -2. **Extension Tables:** When designing tenant configurations, determine which fields exist in the main SKU table vs. which need to be added to tenant-specific extension tables.
  150 +2. **Index Configuration** (`mappings/search_products.json`):
  151 + - Unified index structure shared by all tenants
  152 + - Elasticsearch field mappings and analyzer configurations
  153 + - BM25 similarity with modified parameters (`b=0.0, k1=0.0`)
101 154  
102   -3. **Embedding Caching:** For periodic full indexing, embedding results should be cached to avoid recomputation.
  155 +3. **Query Configuration** (`search/query_config.py`):
  156 + - Query domain definitions (default, category_name, title, brand_name, etc.)
  157 + - Ranking expressions and function_score configurations
  158 + - Translation and embedding settings
  159 +
  160 +### Embedding Models
  161 +
  162 +**Text Embedding** (`embeddings/bge_encoder.py`):
  163 +- Uses BGE-M3 model (`Xorbits/bge-m3`)
  164 +- Singleton pattern with thread-safe initialization
  165 +- Generates 1024-dimensional vectors with GPU/CUDA support
  166 +- Configurable caching to avoid recomputation
103 167  
104   -4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity.
  168 +**Image Embedding** (`embeddings/clip_encoder.py`):
  169 +- Uses CN-CLIP model (ViT-H-14)
  170 +- Downloads and preprocesses images from URLs
  171 +- Supports both local and remote image processing
  172 +- Generates 1024-dimensional vectors
  173 +
  174 +### Search and Ranking
105 175  
106   -5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese).
107   -- 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh && conda activate searchengine
  176 +**Hybrid Search Approach**:
  177 +- Combines traditional BM25 text relevance with dense vector similarity
  178 +- Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP)
  179 +- Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)`
  180 +
  181 +**Boolean Search Support**:
  182 +- Full boolean logic with AND, OR, ANDNOT, RANK operators
  183 +- Parentheses for complex query structures
  184 +- Configurable operator precedence
  185 +
  186 +**Faceted Search**:
  187 +- Terms and range faceting support
  188 +- Multi-dimensional filtering capabilities
  189 +- Configurable facet fields and aggregations
  190 +
  191 +## Testing Infrastructure
  192 +
  193 +**Test Framework**: pytest with async support
  194 +
  195 +**Test Structure**:
  196 +- `tests/conftest.py`: Comprehensive test fixtures and configuration
  197 +- `tests/unit/`: Unit tests for individual components
  198 +- `tests/integration/`: Integration tests for system workflows
  199 +- Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api`
  200 +
  201 +**Test Data**:
  202 +- Tenant1: Mock data with 10,000 product records
  203 +- Tenant2: CSV-based test dataset
  204 +- Automated test data generation via `scripts/mock_data.sh`
  205 +
  206 +**Key Test Fixtures** (from `conftest.py`):
  207 +- `sample_search_config`: Complete configuration for testing
  208 +- `mock_es_client`: Mocked Elasticsearch client
  209 +- `test_searcher`: Searcher instance with mock dependencies
  210 +- `temp_config_file`: Temporary YAML configuration for tests
  211 +
  212 +## API Endpoints
  213 +
  214 +**Main API** (FastAPI on port 6002):
  215 +- `POST /search/` - Text search with multi-language support
  216 +- `POST /image-search/` - Image search using CN-CLIP embeddings
  217 +- Health check and management endpoints
  218 +- Multi-tenant support via `X-Tenant-ID` header
  219 +
  220 +**API Features**:
  221 +- Hybrid search combining text and vector similarity
  222 +- Configurable ranking and filtering
  223 +- Faceted search with aggregations
  224 +- Multi-language query processing and translation
  225 +- Real-time search with configurable result sizes
  226 +
  227 +## Key Implementation Details
  228 +
  229 +1. **Environment Variables**: All sensitive configuration stored in `.env` (template: `.env.example`)
  230 +2. **Configuration Management**: Dynamic config loading through `config/config_loader.py`
  231 +3. **Error Handling**: Comprehensive error handling with proper HTTP status codes
  232 +4. **Performance**: Batch processing for indexing, embedding caching, and connection pooling
  233 +5. **Logging**: Structured logging with request tracing for debugging
  234 +6. **Security**: Tenant isolation at the index level with proper access controls
  235 +
  236 +## Database Tables
  237 +
  238 +**Main Tables**:
  239 +- `shoplazza_product_sku` - SKU level product data with pricing and inventory
  240 +- `shoplazza_product_spu` - SPU level product data with categories and attributes
  241 +- Tenant extension tables for custom fields and multi-language content
  242 +
  243 +**Data Processing**:
  244 +- Full data sync handled by separate Java project (not in this repo)
  245 +- This repository includes test implementations for development and debugging
  246 +- Extension tables joined with main tables during indexing process
108 247  
... ...
README.md
... ... @@ -2,6 +2,10 @@
2 2  
3 3 一个针对跨境独立站(店匠 Shoplazza 等)的多租户可配置搜索平台。README 作为项目导航入口,帮助你在不同阶段定位到更详细的文档。
4 4  
  5 +## 项目环境
  6 +source /home/tw/miniconda3/etc/profile.d/conda.sh
  7 +conda activate searchengine
  8 +
5 9 ## 核心能力速览
6 10  
7 11 - **多语言 + 自动翻译**:中文、英文、俄文等语言检测与路由(BGE-M3、DeepL)
... ...
scripts/csv_to_excel_multi_variant.py 0 → 100755
... ... @@ -0,0 +1,616 @@
  1 +#!/usr/bin/env python3
  2 +"""
  3 +Convert CSV data to Excel import template with multi-variant support.
  4 +
  5 +Reads CSV file (goods_with_pic.5years_congku.csv.shuf.1w) and generates Excel file
  6 +based on the template format (商品导入模板.xlsx).
  7 +
  8 +Features:
  9 +- 30% products as Single variant (S type)
  10 +- 70% products as Multi variant (M+P type) with color, size, material options
  11 +"""
  12 +
  13 +import sys
  14 +import os
  15 +import csv
  16 +import random
  17 +import argparse
  18 +import re
  19 +from pathlib import Path
  20 +from datetime import datetime, timedelta
  21 +import itertools
  22 +from openpyxl import load_workbook
  23 +from openpyxl.styles import Alignment
  24 +
  25 +# Add parent directory to path
  26 +sys.path.insert(0, str(Path(__file__).parent.parent))
  27 +
  28 +# Color definitions
  29 +COLORS = [
  30 + "Red", "Blue", "Green", "Yellow", "Black", "White", "Orange", "Purple",
  31 + "Pink", "Brown", "Gray", "Navy", "Beige", "Cream", "Maroon", "Olive",
  32 + "Teal", "Cyan", "Magenta", "Lime", "Indigo", "Gold", "Silver", "Bronze",
  33 + "Coral", "Turquoise", "Violet", "Khaki", "Charcoal", "Ivory"
  34 +]
  35 +
  36 +
  37 +def clean_value(value):
  38 + """
  39 + Clean and normalize value.
  40 +
  41 + Args:
  42 + value: Value to clean
  43 +
  44 + Returns:
  45 + Cleaned string value
  46 + """
  47 + if value is None:
  48 + return ''
  49 + value = str(value).strip()
  50 + # Remove surrounding quotes
  51 + if value.startswith('"') and value.endswith('"'):
  52 + value = value[1:-1]
  53 + return value
  54 +
  55 +
  56 +def parse_csv_row(row: dict) -> dict:
  57 + """
  58 + Parse CSV row and extract fields.
  59 +
  60 + Args:
  61 + row: CSV row dictionary
  62 +
  63 + Returns:
  64 + Parsed data dictionary
  65 + """
  66 + return {
  67 + 'skuId': clean_value(row.get('skuId', '')),
  68 + 'name': clean_value(row.get('name', '')),
  69 + 'name_pinyin': clean_value(row.get('name_pinyin', '')),
  70 + 'create_time': clean_value(row.get('create_time', '')),
  71 + 'ruSkuName': clean_value(row.get('ruSkuName', '')),
  72 + 'enSpuName': clean_value(row.get('enSpuName', '')),
  73 + 'categoryName': clean_value(row.get('categoryName', '')),
  74 + 'supplierName': clean_value(row.get('supplierName', '')),
  75 + 'brandName': clean_value(row.get('brandName', '')),
  76 + 'file_id': clean_value(row.get('file_id', '')),
  77 + 'days_since_last_update': clean_value(row.get('days_since_last_update', '')),
  78 + 'id': clean_value(row.get('id', '')),
  79 + 'imageUrl': clean_value(row.get('imageUrl', ''))
  80 + }
  81 +
  82 +
  83 +def generate_handle(title: str) -> str:
  84 + """
  85 + Generate URL-friendly handle from title.
  86 +
  87 + Args:
  88 + title: Product title
  89 +
  90 + Returns:
  91 + URL-friendly handle (ASCII only)
  92 + """
  93 + # Convert to lowercase
  94 + handle = title.lower()
  95 +
  96 + # Remove non-ASCII characters, keep only letters, numbers, spaces, and hyphens
  97 + handle = re.sub(r'[^a-z0-9\s-]', '', handle)
  98 +
  99 + # Replace spaces and multiple hyphens with single hyphen
  100 + handle = re.sub(r'[-\s]+', '-', handle)
  101 + handle = handle.strip('-')
  102 +
  103 + # Limit length
  104 + if len(handle) > 255:
  105 + handle = handle[:255]
  106 +
  107 + return handle or 'product'
  108 +
  109 +
  110 +def extract_material_from_title(title: str) -> str:
  111 + """
  112 + Extract material from title by taking the last word after splitting by space.
  113 +
  114 + 按照商品标题空格分割后的最后一个字符串作为material。
  115 + 例如:"消防套 塑料【英文包装】" -> 最后一个字符串是 "塑料【英文包装】"
  116 +
  117 + Args:
  118 + title: Product title
  119 +
  120 + Returns:
  121 + Material string (single value)
  122 + """
  123 + if not title:
  124 + return 'default'
  125 +
  126 + # Split by spaces (只按空格分割,保持原样)
  127 + parts = title.strip().split()
  128 + if parts:
  129 + # Get last part (最后一个字符串)
  130 + material = parts[-1]
  131 + # Remove brackets but keep content
  132 + material = re.sub(r'[【】\[\]()()]', '', material)
  133 + material = material.strip()
  134 + if material:
  135 + return material
  136 +
  137 + return 'default'
  138 +
  139 +
  140 +def generate_single_variant_row(csv_data: dict, base_sku_id: int = 1) -> dict:
  141 + """
  142 + Generate Excel row for Single variant (S type) product.
  143 +
  144 + Args:
  145 + csv_data: Parsed CSV row data
  146 + base_sku_id: Base SKU ID for generating SKU code
  147 +
  148 + Returns:
  149 + Dictionary mapping Excel column names to values
  150 + """
  151 + # Parse create_time
  152 + try:
  153 + created_at = datetime.strptime(csv_data['create_time'], '%Y-%m-%d %H:%M:%S')
  154 + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S')
  155 + except:
  156 + created_at = datetime.now() - timedelta(days=random.randint(1, 365))
  157 + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S')
  158 +
  159 + # Generate title - use name or enSpuName
  160 + title = csv_data['name'] or csv_data['enSpuName'] or 'Product'
  161 +
  162 + # Generate handle - prefer enSpuName, then name_pinyin, then title
  163 + handle_source = csv_data['enSpuName'] or csv_data['name_pinyin'] or title
  164 + handle = generate_handle(handle_source)
  165 + if handle and not handle.startswith('products/'):
  166 + handle = f'products/{handle}'
  167 +
  168 + # Generate SEO fields
  169 + seo_title = f"{title} - {csv_data['categoryName']}" if csv_data['categoryName'] else title
  170 + seo_description = f"购买{csv_data['brandName']}{title}" if csv_data['brandName'] else title
  171 + seo_keywords_parts = [title]
  172 + if csv_data['categoryName']:
  173 + seo_keywords_parts.append(csv_data['categoryName'])
  174 + if csv_data['brandName']:
  175 + seo_keywords_parts.append(csv_data['brandName'])
  176 + seo_keywords = ','.join(seo_keywords_parts)
  177 +
  178 + # Generate tags from category and brand
  179 + tags_parts = []
  180 + if csv_data['categoryName']:
  181 + tags_parts.append(csv_data['categoryName'])
  182 + if csv_data['brandName']:
  183 + tags_parts.append(csv_data['brandName'])
  184 + tags = ','.join(tags_parts) if tags_parts else ''
  185 +
  186 + # Generate prices
  187 + price = round(random.uniform(50, 500), 2)
  188 + compare_at_price = round(price * random.uniform(1.2, 1.5), 2)
  189 + cost_price = round(price * 0.6, 2)
  190 +
  191 + # Generate random stock
  192 + inventory_quantity = random.randint(0, 100)
  193 +
  194 + # Generate random weight
  195 + weight = round(random.uniform(0.1, 5.0), 2)
  196 + weight_unit = 'kg'
  197 +
  198 + # Use skuId as SKU code
  199 + sku_code = csv_data['skuId'] or f'SKU-{base_sku_id}'
  200 +
  201 + # Generate barcode
  202 + try:
  203 + sku_id = int(csv_data['skuId']) if csv_data['skuId'] else base_sku_id
  204 + barcode = f"BAR{sku_id:08d}"
  205 + except:
  206 + barcode = f"BAR{base_sku_id:08d}"
  207 +
  208 + # Build description
  209 + description = f"<p>{csv_data['name']}</p>" if csv_data['name'] else ''
  210 +
  211 + # Build brief (subtitle)
  212 + brief = csv_data['name'] or ''
  213 +
  214 + # Excel row data
  215 + excel_row = {
  216 + '商品ID': '', # Empty for new products
  217 + '创建时间': create_time_str,
  218 + '商品标题*': title,
  219 + '商品属性*': 'S', # Single variant product
  220 + '商品副标题': brief,
  221 + '商品描述': description,
  222 + 'SEO标题': seo_title,
  223 + 'SEO描述': seo_description,
  224 + 'SEO URL Handle': handle,
  225 + 'SEO URL 重定向': 'N',
  226 + 'SEO关键词': seo_keywords,
  227 + '商品上架': 'Y',
  228 + '需要物流': 'Y',
  229 + '商品收税': 'N',
  230 + '商品spu': '',
  231 + '启用虚拟销量': 'N',
  232 + '虚拟销量值': '',
  233 + '跟踪库存': 'Y',
  234 + '库存规则*': '1',
  235 + '专辑名称': csv_data['categoryName'] or '',
  236 + '标签': tags,
  237 + '供应商名称': csv_data['supplierName'] or '',
  238 + '供应商URL': '',
  239 + '款式1': '', # Empty for S type
  240 + '款式2': '', # Empty for S type
  241 + '款式3': '', # Empty for S type
  242 + '商品售价*': price,
  243 + '商品原价': compare_at_price,
  244 + '成本价': cost_price,
  245 + '商品SKU': sku_code,
  246 + '商品重量': weight,
  247 + '重量单位': weight_unit,
  248 + '商品条形码': barcode,
  249 + '商品库存': inventory_quantity,
  250 + '尺寸信息': '',
  251 + '原产地国别': '',
  252 + 'HS(协调制度)代码': '',
  253 + '商品图片*': csv_data['imageUrl'] or '',
  254 + '商品备注': '',
  255 + '款式备注': '',
  256 + '商品主图': csv_data['imageUrl'] or '',
  257 + }
  258 +
  259 + return excel_row
  260 +
  261 +
  262 +def generate_multi_variant_rows(csv_data: dict, base_sku_id: int = 1) -> list:
  263 + """
  264 + Generate Excel rows for Multi variant (M+P type) product.
  265 +
  266 + Returns a list of rows:
  267 + - First row: M (主商品) with option names
  268 + - Following rows: P (子款式) with option values
  269 +
  270 + Args:
  271 + csv_data: Parsed CSV row data
  272 + base_sku_id: Base SKU ID for generating SKU codes
  273 +
  274 + Returns:
  275 + List of dictionaries mapping Excel column names to values
  276 + """
  277 + rows = []
  278 +
  279 + # Parse create_time
  280 + try:
  281 + created_at = datetime.strptime(csv_data['create_time'], '%Y-%m-%d %H:%M:%S')
  282 + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S')
  283 + except:
  284 + created_at = datetime.now() - timedelta(days=random.randint(1, 365))
  285 + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S')
  286 +
  287 + # Generate title
  288 + title = csv_data['name'] or csv_data['enSpuName'] or 'Product'
  289 +
  290 + # Generate handle
  291 + handle_source = csv_data['enSpuName'] or csv_data['name_pinyin'] or title
  292 + handle = generate_handle(handle_source)
  293 + if handle and not handle.startswith('products/'):
  294 + handle = f'products/{handle}'
  295 +
  296 + # Generate SEO fields
  297 + seo_title = f"{title} - {csv_data['categoryName']}" if csv_data['categoryName'] else title
  298 + seo_description = f"购买{csv_data['brandName']}{title}" if csv_data['brandName'] else title
  299 + seo_keywords_parts = [title]
  300 + if csv_data['categoryName']:
  301 + seo_keywords_parts.append(csv_data['categoryName'])
  302 + if csv_data['brandName']:
  303 + seo_keywords_parts.append(csv_data['brandName'])
  304 + seo_keywords = ','.join(seo_keywords_parts)
  305 +
  306 + # Generate tags
  307 + tags_parts = []
  308 + if csv_data['categoryName']:
  309 + tags_parts.append(csv_data['categoryName'])
  310 + if csv_data['brandName']:
  311 + tags_parts.append(csv_data['brandName'])
  312 + tags = ','.join(tags_parts) if tags_parts else ''
  313 +
  314 + # Extract material from title (last word after splitting by space)
  315 + material = extract_material_from_title(title)
  316 +
  317 + # Generate color options: randomly select 2-10 colors from COLORS list
  318 + num_colors = random.randint(2, 10)
  319 + selected_colors = random.sample(COLORS, min(num_colors, len(COLORS)))
  320 +
  321 + # Generate size options: 1-30, randomly select 4-8
  322 + num_sizes = random.randint(4, 8)
  323 + all_sizes = [str(i) for i in range(1, 31)]
  324 + selected_sizes = random.sample(all_sizes, num_sizes)
  325 +
  326 + # Material has only one value
  327 + materials = [material]
  328 +
  329 + # Generate all combinations (Cartesian product)
  330 + variants = list(itertools.product(selected_colors, selected_sizes, materials))
  331 +
  332 + # Generate M row (主商品)
  333 + description = f"<p>{csv_data['name']}</p>" if csv_data['name'] else ''
  334 + brief = csv_data['name'] or ''
  335 +
  336 + m_row = {
  337 + '商品ID': '',
  338 + '创建时间': create_time_str,
  339 + '商品标题*': title,
  340 + '商品属性*': 'M', # Main product
  341 + '商品副标题': brief,
  342 + '商品描述': description,
  343 + 'SEO标题': seo_title,
  344 + 'SEO描述': seo_description,
  345 + 'SEO URL Handle': handle,
  346 + 'SEO URL 重定向': 'N',
  347 + 'SEO关键词': seo_keywords,
  348 + '商品上架': 'Y',
  349 + '需要物流': 'Y',
  350 + '商品收税': 'N',
  351 + '商品spu': '',
  352 + '启用虚拟销量': 'N',
  353 + '虚拟销量值': '',
  354 + '跟踪库存': 'Y',
  355 + '库存规则*': '1',
  356 + '专辑名称': csv_data['categoryName'] or '',
  357 + '标签': tags,
  358 + '供应商名称': csv_data['supplierName'] or '',
  359 + '供应商URL': '',
  360 + '款式1': 'color', # Option name
  361 + '款式2': 'size', # Option name
  362 + '款式3': 'material', # Option name
  363 + '商品售价*': '', # Empty for M row
  364 + '商品原价': '',
  365 + '成本价': '',
  366 + '商品SKU': '', # Empty for M row
  367 + '商品重量': '',
  368 + '重量单位': '',
  369 + '商品条形码': '',
  370 + '商品库存': '', # Empty for M row
  371 + '尺寸信息': '',
  372 + '原产地国别': '',
  373 + 'HS(协调制度)代码': '',
  374 + '商品图片*': csv_data['imageUrl'] or '', # Main product image
  375 + '商品备注': '',
  376 + '款式备注': '',
  377 + '商品主图': csv_data['imageUrl'] or '',
  378 + }
  379 + rows.append(m_row)
  380 +
  381 + # Generate P rows (子款式) for each variant combination
  382 + base_price = round(random.uniform(50, 500), 2)
  383 +
  384 + for variant_idx, (color, size, mat) in enumerate(variants):
  385 + # Generate price variation (within ±20% of base)
  386 + price = round(base_price * random.uniform(0.8, 1.2), 2)
  387 + compare_at_price = round(price * random.uniform(1.2, 1.5), 2)
  388 + cost_price = round(price * 0.6, 2)
  389 +
  390 + # Generate random stock
  391 + inventory_quantity = random.randint(0, 100)
  392 +
  393 + # Generate random weight
  394 + weight = round(random.uniform(0.1, 5.0), 2)
  395 + weight_unit = 'kg'
  396 +
  397 + # Generate SKU code
  398 + sku_code = f"{csv_data['skuId']}-{color}-{size}-{mat}" if csv_data['skuId'] else f'SKU-{base_sku_id}-{variant_idx+1}'
  399 +
  400 + # Generate barcode
  401 + barcode = f"BAR{base_sku_id:08d}{variant_idx+1:03d}"
  402 +
  403 + p_row = {
  404 + '商品ID': '',
  405 + '创建时间': create_time_str,
  406 + '商品标题*': title, # Same as M row
  407 + '商品属性*': 'P', # Variant
  408 + '商品副标题': '', # Empty for P row
  409 + '商品描述': '', # Empty for P row
  410 + 'SEO标题': '', # Empty for P row
  411 + 'SEO描述': '', # Empty for P row
  412 + 'SEO URL Handle': '', # Empty for P row
  413 + 'SEO URL 重定向': '',
  414 + 'SEO关键词': '',
  415 + '商品上架': 'Y',
  416 + '需要物流': 'Y',
  417 + '商品收税': 'N',
  418 + '商品spu': '',
  419 + '启用虚拟销量': 'N',
  420 + '虚拟销量值': '',
  421 + '跟踪库存': 'Y',
  422 + '库存规则*': '1',
  423 + '专辑名称': '', # Empty for P row
  424 + '标签': '', # Empty for P row
  425 + '供应商名称': '', # Empty for P row
  426 + '供应商URL': '',
  427 + '款式1': color, # Option value
  428 + '款式2': size, # Option value
  429 + '款式3': mat, # Option value
  430 + '商品售价*': price,
  431 + '商品原价': compare_at_price,
  432 + '成本价': cost_price,
  433 + '商品SKU': sku_code,
  434 + '商品重量': weight,
  435 + '重量单位': weight_unit,
  436 + '商品条形码': barcode,
  437 + '商品库存': inventory_quantity,
  438 + '尺寸信息': '',
  439 + '原产地国别': '',
  440 + 'HS(协调制度)代码': '',
  441 + '商品图片*': '', # Empty for P row (uses main product image)
  442 + '商品备注': '',
  443 + '款式备注': '',
  444 + '商品主图': '',
  445 + }
  446 + rows.append(p_row)
  447 +
  448 + return rows
  449 +
  450 +
  451 +def read_csv_file(csv_file: str) -> list:
  452 + """
  453 + Read CSV file and return list of parsed rows.
  454 +
  455 + Args:
  456 + csv_file: Path to CSV file
  457 +
  458 + Returns:
  459 + List of parsed CSV data dictionaries
  460 + """
  461 + csv_data_list = []
  462 +
  463 + with open(csv_file, 'r', encoding='utf-8') as f:
  464 + reader = csv.DictReader(f)
  465 + for row in reader:
  466 + parsed = parse_csv_row(row)
  467 + csv_data_list.append(parsed)
  468 +
  469 + return csv_data_list
  470 +
  471 +
  472 +def create_excel_from_template(template_file: str, output_file: str, excel_rows: list):
  473 + """
  474 + Create Excel file from template and fill with data rows.
  475 +
  476 + Args:
  477 + template_file: Path to Excel template file
  478 + output_file: Path to output Excel file
  479 + excel_rows: List of dictionaries mapping Excel column names to values
  480 + """
  481 + # Load template
  482 + wb = load_workbook(template_file)
  483 + ws = wb.active # Use the active sheet (Sheet4)
  484 +
  485 + # Find header row (row 2)
  486 + header_row_idx = 2
  487 +
  488 + # Get column mapping from header row
  489 + column_mapping = {}
  490 + for col_idx in range(1, ws.max_column + 1):
  491 + cell_value = ws.cell(row=header_row_idx, column=col_idx).value
  492 + if cell_value:
  493 + column_mapping[cell_value] = col_idx
  494 +
  495 + # Start writing data from row 4
  496 + data_start_row = 4
  497 +
  498 + # Clear existing data rows
  499 + last_template_row = ws.max_row
  500 + if last_template_row >= data_start_row:
  501 + for row in range(data_start_row, last_template_row + 1):
  502 + for col in range(1, ws.max_column + 1):
  503 + ws.cell(row=row, column=col).value = None
  504 +
  505 + # Write data rows
  506 + for row_idx, excel_row in enumerate(excel_rows):
  507 + excel_row_num = data_start_row + row_idx
  508 +
  509 + # Write each field to corresponding column
  510 + for field_name, col_idx in column_mapping.items():
  511 + if field_name in excel_row:
  512 + cell = ws.cell(row=excel_row_num, column=col_idx)
  513 + value = excel_row[field_name]
  514 + cell.value = value
  515 +
  516 + # Set alignment
  517 + if isinstance(value, str):
  518 + cell.alignment = Alignment(vertical='top', wrap_text=True)
  519 + elif isinstance(value, (int, float)):
  520 + cell.alignment = Alignment(vertical='top')
  521 +
  522 + # Save workbook
  523 + wb.save(output_file)
  524 + print(f"Excel file created: {output_file}")
  525 + print(f" - Total rows: {len(excel_rows)}")
  526 +
  527 +
  528 +def main():
  529 + parser = argparse.ArgumentParser(description='Convert CSV data to Excel import template with multi-variant support')
  530 + parser.add_argument('--csv-file',
  531 + default='data/customer1/goods_with_pic.5years_congku.csv.shuf.1w',
  532 + help='CSV file path')
  533 + parser.add_argument('--template',
  534 + default='docs/商品导入模板.xlsx',
  535 + help='Excel template file path')
  536 + parser.add_argument('--output',
  537 + default='商品导入数据.xlsx',
  538 + help='Output Excel file path')
  539 + parser.add_argument('--limit',
  540 + type=int,
  541 + default=None,
  542 + help='Limit number of products to process')
  543 + parser.add_argument('--single-ratio',
  544 + type=float,
  545 + default=0.3,
  546 + help='Ratio of single variant products (default: 0.3 = 30%%)')
  547 + parser.add_argument('--seed',
  548 + type=int,
  549 + default=None,
  550 + help='Random seed for reproducible results')
  551 +
  552 + args = parser.parse_args()
  553 +
  554 + # Set random seed if provided
  555 + if args.seed is not None:
  556 + random.seed(args.seed)
  557 +
  558 + # Check if files exist
  559 + if not os.path.exists(args.csv_file):
  560 + print(f"Error: CSV file not found: {args.csv_file}")
  561 + sys.exit(1)
  562 +
  563 + if not os.path.exists(args.template):
  564 + print(f"Error: Template file not found: {args.template}")
  565 + sys.exit(1)
  566 +
  567 + # Read CSV file
  568 + print(f"Reading CSV file: {args.csv_file}")
  569 + csv_data_list = read_csv_file(args.csv_file)
  570 + print(f"Read {len(csv_data_list)} rows from CSV")
  571 +
  572 + # Limit products if specified
  573 + if args.limit:
  574 + csv_data_list = csv_data_list[:args.limit]
  575 + print(f"Limited to {len(csv_data_list)} products")
  576 +
  577 + # Generate Excel rows
  578 + print(f"\nGenerating Excel rows...")
  579 + print(f" - Single variant ratio: {args.single_ratio*100:.0f}%")
  580 + print(f" - Multi variant ratio: {(1-args.single_ratio)*100:.0f}%")
  581 +
  582 + excel_rows = []
  583 + single_count = 0
  584 + multi_count = 0
  585 +
  586 + for idx, csv_data in enumerate(csv_data_list):
  587 + # Decide if this product should be single or multi variant
  588 + is_single = random.random() < args.single_ratio
  589 +
  590 + if is_single:
  591 + # Generate single variant (S type)
  592 + row = generate_single_variant_row(csv_data, base_sku_id=idx+1)
  593 + excel_rows.append(row)
  594 + single_count += 1
  595 + else:
  596 + # Generate multi variant (M+P type)
  597 + rows = generate_multi_variant_rows(csv_data, base_sku_id=idx+1)
  598 + excel_rows.extend(rows)
  599 + multi_count += 1
  600 +
  601 + print(f"\nGenerated:")
  602 + print(f" - Single variant products: {single_count}")
  603 + print(f" - Multi variant products: {multi_count}")
  604 + print(f" - Total Excel rows: {len(excel_rows)}")
  605 +
  606 + # Create Excel file
  607 + print(f"\nCreating Excel file from template: {args.template}")
  608 + print(f"Output file: {args.output}")
  609 + create_excel_from_template(args.template, args.output, excel_rows)
  610 +
  611 + print(f"\nDone! Generated {len(excel_rows)} rows in Excel file.")
  612 +
  613 +
  614 +if __name__ == '__main__':
  615 + main()
  616 +
... ...