diff --git a/CLAUDE.md b/CLAUDE.md index f0d5a7d..145feed 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,18 +4,25 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co ## Project Overview -This is a **Search Engine SaaS** project for e-commerce product search, designed for Shoplazza (店匠) independent sites. The system provides Elasticsearch-based product search capabilities with multi-language support, text/image embeddings, and configurable ranking. +This is a **Search Engine SaaS** platform for e-commerce product search, specifically designed for Shoplazza (店匠) independent sites. It's a multi-tenant configurable search system built on Elasticsearch with AI-powered search capabilities. **Tech Stack:** -- Elasticsearch as the search engine backend +- Elasticsearch 8.x as the search engine backend - MySQL (Shoplazza database) as the primary data source -- Python for data processing and ingestion +- Python 3.10 with PyTorch/CUDA support - BGE-M3 model for text embeddings (1024-dim vectors) - CN-CLIP (ViT-H-14) for image embeddings +- FastAPI for REST API layer -## Database Configuration +## Development Environment -**Shoplazza Production Database:** +**Required Environment Setup:** +```bash +source /home/tw/miniconda3/etc/profile.d/conda.sh +conda activate searchengine +``` + +**Database Configuration:** ``` host: 120.79.247.228 port: 3316 @@ -24,85 +31,217 @@ username: saas password: P89cZHS5d7dFyc9R ``` -**Main Tables:** -- `shoplazza_product_sku` - SKU level product data -- `shoplazza_product_spu` - SPU level product data +## Common Development Commands -## Architecture +### Environment Setup +```bash +# Complete environment setup +./setup.sh -### Data Flow -1. **Data Source (MySQL)** → Main tables (`shoplazza_product_sku`, `shoplazza_product_spu`) + tenant extension tables -2. **Indexer** → Reads from MySQL, applies transformations (embeddings, etc.), writes to Elasticsearch -3. **Query Parser** → Query rewriting, translation, text embedding conversion -4. **Searcher** → Executes searches against Elasticsearch with configurable ranking +# Install Python dependencies +pip install -r requirements.txt +``` -### Multi-Tenant Design -Each tenant has their own extension table to store custom attributes, multi-language fields (titles, brand names, tags, categories), and business-specific metadata. The main SKU table is joined with tenant extension tables during indexing. +### Data Management +```bash +# Generate test data (Tenant1 Mock + Tenant2 CSV) +./scripts/mock_data.sh -### Configuration System +# Ingest data to Elasticsearch +./scripts/ingest.sh [recreate] # e.g., ./scripts/ingest.sh 1 true +python main.py ingest data.csv --limit 1000 --batch-size 50 +``` -The system uses two types of configurations per tenant: +### Running Services +```bash +# Start all services (production) +./run.sh -1. **Application Structure Config** (`IndexerConfig`) - Defines: - - Input field mappings from MySQL to Elasticsearch - - Field types (TEXT, EMBEDDING, LITERAL, INT, DOUBLE, etc.) - - Which fields require preprocessing (embeddings, transformations) +# Start development server with auto-reload +./scripts/start_backend.sh +python main.py serve --host 0.0.0.0 --port 6002 --reload -2. **Index Structure Config** - Defines: - - Elasticsearch field mappings and analyzers - - Supported analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese - - Query domain definitions (default, category_name, title, brand_name, etc.) - - BM25 parameters and similarity configurations +# Start frontend debugging UI +./scripts/start_frontend.sh +``` -### Query Processing +### Testing +```bash +# Run all tests +python -m pytest tests/ -The `queryParser` performs: -1. **Query Rewriting** - Dictionary-based rewriting (brand terms, category terms, synonyms, corrections) -2. **Translation** - Language detection and translation to support multi-language search (e.g., zh↔en) -3. **Text Embedding** - Converts query text to vectors when vector search is enabled +# Run specific test types +python -m pytest tests/unit/ # Unit tests +python -m pytest tests/integration/ # Integration tests +python -m pytest -m "api" # API tests only -### Search and Ranking +# Test search from command line +python main.py search "query" --tenant-id 1 --size 10 +``` -The `searcher` supports: -- Boolean operators: AND, OR, RANK, ANDNOT with parentheses -- Operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK` -- Configurable ranking expressions for the `default` domain: - - Example: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)` - - Combines BM25 text relevance, embedding similarity, product scores, and time decay +### Development Utilities +```bash +# Stop all services +./scripts/stop.sh -### Embedding Modules +# Test environment (for CI/development) +./scripts/start_test_environment.sh +./scripts/stop_test_environment.sh -**Text Embedding** - Uses BGE-M3 model (`Xorbits/bge-m3`): -- Singleton pattern with thread-safe initialization -- Generates 1024-dimensional vectors -- Configured for GPU/CUDA acceleration +# Install server dependencies +./scripts/install_server_deps.sh +``` -**Image Embedding** - Uses CN-CLIP model (ViT-H-14): -- Downloads and validates images from URLs -- Preprocesses images (resize, RGB conversion) -- Generates 1024-dimensional vectors -- Supports both local and remote images +## Architecture Overview + +### Core Components +``` +/data/tw/SearchEngine/ +├── api/ # FastAPI REST API service (port 6002) +├── config/ # Configuration management system +├── indexer/ # MySQL → Elasticsearch data pipeline +├── search/ # Search engine and ranking logic +├── query/ # Query parsing, translation, rewriting +├── embeddings/ # ML models (BGE-M3, CN-CLIP) +├── scripts/ # Automation and utility scripts +├── utils/ # Shared utilities (ES client, etc.) +├── frontend/ # Simple debugging UI +├── mappings/ # Elasticsearch index mappings +└── tests/ # Unit and integration tests +``` + +### Data Flow Architecture +**Pipeline**: MySQL → Indexer → Elasticsearch → API → Frontend + +1. **Data Source Layer**: + - Shoplazza MySQL database with `shoplazza_product_sku` and `shoplazza_product_spu` tables + - Tenant-specific extension tables for custom attributes and multi-language fields + +2. **Indexing Layer** (`indexer/`): + - Reads from MySQL, applies transformations with embeddings + - Uses `DataTransformer` and `IndexingPipeline` for batch processing + - Supports both full and incremental indexing with embedding caching + +3. **Query Processing Layer** (`query/`): + - `QueryParser`: Handles query rewriting, translation, and text embedding conversion + - Multi-language support with automatic detection and translation + - Boolean logic parsing with operator precedence: `()` > `ANDNOT` > `AND` > `OR` > `RANK` + +4. **Search Engine Layer** (`search/`): + - `Searcher`: Executes hybrid searches combining BM25 and dense vectors + - Configurable ranking expressions with function_score support + - Multi-tenant isolation via `tenant_id` field + +5. **API Layer** (`api/`): + - FastAPI service on port 6002 with multi-tenant support + - Text search: `POST /search/` + - Image search: `POST /image-search/` + - Tenant identification via `X-Tenant-ID` header -## Test Data -、 -**Tenant1 Test Dataset:** -- Location: `data/tenant1/goods_with_pic.5years_congku.csv.shuf.1w` -- Contains 10,000 shuffled product records with images -- Processing script: `data/tenant1/task2_process_goods.py` - - Extracts product data from MySQL - - Maps images from filebank database - - Creates inverted index (URL → SKU list) +### Multi-Tenant Configuration System -## Key Implementation Notes +The system uses centralized configuration through `config/config.yaml`: -1. **Data Sync:** Full data sync from MySQL to Elasticsearch is handled by a separate Java project (not in this repo). This repo may include a simple full-load implementation for testing purposes. +1. **Field Configuration** (`config/field_types.py`): + - Defines field types: TEXT, KEYWORD, EMBEDDING, INT, DOUBLE, etc. + - Specifies analyzers: Chinese (ansj), English, Arabic, Spanish, Russian, Japanese + - Required fields and preprocessing rules -2. **Extension Tables:** When designing tenant configurations, determine which fields exist in the main SKU table vs. which need to be added to tenant-specific extension tables. +2. **Index Configuration** (`mappings/search_products.json`): + - Unified index structure shared by all tenants + - Elasticsearch field mappings and analyzer configurations + - BM25 similarity with modified parameters (`b=0.0, k1=0.0`) -3. **Embedding Caching:** For periodic full indexing, embedding results should be cached to avoid recomputation. +3. **Query Configuration** (`search/query_config.py`): + - Query domain definitions (default, category_name, title, brand_name, etc.) + - Ranking expressions and function_score configurations + - Translation and embedding settings + +### Embedding Models + +**Text Embedding** (`embeddings/bge_encoder.py`): +- Uses BGE-M3 model (`Xorbits/bge-m3`) +- Singleton pattern with thread-safe initialization +- Generates 1024-dimensional vectors with GPU/CUDA support +- Configurable caching to avoid recomputation -4. **ES Similarity Configuration:** All text fields use modified BM25 with `b=0.0, k1=0.0` as the default similarity. +**Image Embedding** (`embeddings/clip_encoder.py`): +- Uses CN-CLIP model (ViT-H-14) +- Downloads and preprocesses images from URLs +- Supports both local and remote image processing +- Generates 1024-dimensional vectors + +### Search and Ranking -5. **Multi-Language Support:** The system is designed for cross-border e-commerce with at minimum Chinese and English support, with extensibility for other languages (Arabic, Spanish, Russian, Japanese). -- 记住这个项目的环境是source /home/tw/miniconda3/etc/profile.d/conda.sh && conda activate searchengine +**Hybrid Search Approach**: +- Combines traditional BM25 text relevance with dense vector similarity +- Supports text embeddings (BGE-M3) and image embeddings (CN-CLIP) +- Configurable ranking expressions like: `static_bm25() + 0.2*text_embedding_relevance() + general_score*2 + timeliness(end_time)` + +**Boolean Search Support**: +- Full boolean logic with AND, OR, ANDNOT, RANK operators +- Parentheses for complex query structures +- Configurable operator precedence + +**Faceted Search**: +- Terms and range faceting support +- Multi-dimensional filtering capabilities +- Configurable facet fields and aggregations + +## Testing Infrastructure + +**Test Framework**: pytest with async support + +**Test Structure**: +- `tests/conftest.py`: Comprehensive test fixtures and configuration +- `tests/unit/`: Unit tests for individual components +- `tests/integration/`: Integration tests for system workflows +- Test markers: `@pytest.mark.unit`, `@pytest.mark.integration`, `@pytest.mark.api` + +**Test Data**: +- Tenant1: Mock data with 10,000 product records +- Tenant2: CSV-based test dataset +- Automated test data generation via `scripts/mock_data.sh` + +**Key Test Fixtures** (from `conftest.py`): +- `sample_search_config`: Complete configuration for testing +- `mock_es_client`: Mocked Elasticsearch client +- `test_searcher`: Searcher instance with mock dependencies +- `temp_config_file`: Temporary YAML configuration for tests + +## API Endpoints + +**Main API** (FastAPI on port 6002): +- `POST /search/` - Text search with multi-language support +- `POST /image-search/` - Image search using CN-CLIP embeddings +- Health check and management endpoints +- Multi-tenant support via `X-Tenant-ID` header + +**API Features**: +- Hybrid search combining text and vector similarity +- Configurable ranking and filtering +- Faceted search with aggregations +- Multi-language query processing and translation +- Real-time search with configurable result sizes + +## Key Implementation Details + +1. **Environment Variables**: All sensitive configuration stored in `.env` (template: `.env.example`) +2. **Configuration Management**: Dynamic config loading through `config/config_loader.py` +3. **Error Handling**: Comprehensive error handling with proper HTTP status codes +4. **Performance**: Batch processing for indexing, embedding caching, and connection pooling +5. **Logging**: Structured logging with request tracing for debugging +6. **Security**: Tenant isolation at the index level with proper access controls + +## Database Tables + +**Main Tables**: +- `shoplazza_product_sku` - SKU level product data with pricing and inventory +- `shoplazza_product_spu` - SPU level product data with categories and attributes +- Tenant extension tables for custom fields and multi-language content + +**Data Processing**: +- Full data sync handled by separate Java project (not in this repo) +- This repository includes test implementations for development and debugging +- Extension tables joined with main tables during indexing process diff --git a/README.md b/README.md index 1ba08b7..036025d 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,10 @@ 一个针对跨境独立站(店匠 Shoplazza 等)的多租户可配置搜索平台。README 作为项目导航入口,帮助你在不同阶段定位到更详细的文档。 +## 项目环境 +source /home/tw/miniconda3/etc/profile.d/conda.sh +conda activate searchengine + ## 核心能力速览 - **多语言 + 自动翻译**:中文、英文、俄文等语言检测与路由(BGE-M3、DeepL) diff --git a/scripts/csv_to_excel_multi_variant.py b/scripts/csv_to_excel_multi_variant.py new file mode 100755 index 0000000..4df2e1d --- /dev/null +++ b/scripts/csv_to_excel_multi_variant.py @@ -0,0 +1,616 @@ +#!/usr/bin/env python3 +""" +Convert CSV data to Excel import template with multi-variant support. + +Reads CSV file (goods_with_pic.5years_congku.csv.shuf.1w) and generates Excel file +based on the template format (商品导入模板.xlsx). + +Features: +- 30% products as Single variant (S type) +- 70% products as Multi variant (M+P type) with color, size, material options +""" + +import sys +import os +import csv +import random +import argparse +import re +from pathlib import Path +from datetime import datetime, timedelta +import itertools +from openpyxl import load_workbook +from openpyxl.styles import Alignment + +# Add parent directory to path +sys.path.insert(0, str(Path(__file__).parent.parent)) + +# Color definitions +COLORS = [ + "Red", "Blue", "Green", "Yellow", "Black", "White", "Orange", "Purple", + "Pink", "Brown", "Gray", "Navy", "Beige", "Cream", "Maroon", "Olive", + "Teal", "Cyan", "Magenta", "Lime", "Indigo", "Gold", "Silver", "Bronze", + "Coral", "Turquoise", "Violet", "Khaki", "Charcoal", "Ivory" +] + + +def clean_value(value): + """ + Clean and normalize value. + + Args: + value: Value to clean + + Returns: + Cleaned string value + """ + if value is None: + return '' + value = str(value).strip() + # Remove surrounding quotes + if value.startswith('"') and value.endswith('"'): + value = value[1:-1] + return value + + +def parse_csv_row(row: dict) -> dict: + """ + Parse CSV row and extract fields. + + Args: + row: CSV row dictionary + + Returns: + Parsed data dictionary + """ + return { + 'skuId': clean_value(row.get('skuId', '')), + 'name': clean_value(row.get('name', '')), + 'name_pinyin': clean_value(row.get('name_pinyin', '')), + 'create_time': clean_value(row.get('create_time', '')), + 'ruSkuName': clean_value(row.get('ruSkuName', '')), + 'enSpuName': clean_value(row.get('enSpuName', '')), + 'categoryName': clean_value(row.get('categoryName', '')), + 'supplierName': clean_value(row.get('supplierName', '')), + 'brandName': clean_value(row.get('brandName', '')), + 'file_id': clean_value(row.get('file_id', '')), + 'days_since_last_update': clean_value(row.get('days_since_last_update', '')), + 'id': clean_value(row.get('id', '')), + 'imageUrl': clean_value(row.get('imageUrl', '')) + } + + +def generate_handle(title: str) -> str: + """ + Generate URL-friendly handle from title. + + Args: + title: Product title + + Returns: + URL-friendly handle (ASCII only) + """ + # Convert to lowercase + handle = title.lower() + + # Remove non-ASCII characters, keep only letters, numbers, spaces, and hyphens + handle = re.sub(r'[^a-z0-9\s-]', '', handle) + + # Replace spaces and multiple hyphens with single hyphen + handle = re.sub(r'[-\s]+', '-', handle) + handle = handle.strip('-') + + # Limit length + if len(handle) > 255: + handle = handle[:255] + + return handle or 'product' + + +def extract_material_from_title(title: str) -> str: + """ + Extract material from title by taking the last word after splitting by space. + + 按照商品标题空格分割后的最后一个字符串作为material。 + 例如:"消防套 塑料【英文包装】" -> 最后一个字符串是 "塑料【英文包装】" + + Args: + title: Product title + + Returns: + Material string (single value) + """ + if not title: + return 'default' + + # Split by spaces (只按空格分割,保持原样) + parts = title.strip().split() + if parts: + # Get last part (最后一个字符串) + material = parts[-1] + # Remove brackets but keep content + material = re.sub(r'[【】\[\]()()]', '', material) + material = material.strip() + if material: + return material + + return 'default' + + +def generate_single_variant_row(csv_data: dict, base_sku_id: int = 1) -> dict: + """ + Generate Excel row for Single variant (S type) product. + + Args: + csv_data: Parsed CSV row data + base_sku_id: Base SKU ID for generating SKU code + + Returns: + Dictionary mapping Excel column names to values + """ + # Parse create_time + try: + created_at = datetime.strptime(csv_data['create_time'], '%Y-%m-%d %H:%M:%S') + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S') + except: + created_at = datetime.now() - timedelta(days=random.randint(1, 365)) + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S') + + # Generate title - use name or enSpuName + title = csv_data['name'] or csv_data['enSpuName'] or 'Product' + + # Generate handle - prefer enSpuName, then name_pinyin, then title + handle_source = csv_data['enSpuName'] or csv_data['name_pinyin'] or title + handle = generate_handle(handle_source) + if handle and not handle.startswith('products/'): + handle = f'products/{handle}' + + # Generate SEO fields + seo_title = f"{title} - {csv_data['categoryName']}" if csv_data['categoryName'] else title + seo_description = f"购买{csv_data['brandName']}{title}" if csv_data['brandName'] else title + seo_keywords_parts = [title] + if csv_data['categoryName']: + seo_keywords_parts.append(csv_data['categoryName']) + if csv_data['brandName']: + seo_keywords_parts.append(csv_data['brandName']) + seo_keywords = ','.join(seo_keywords_parts) + + # Generate tags from category and brand + tags_parts = [] + if csv_data['categoryName']: + tags_parts.append(csv_data['categoryName']) + if csv_data['brandName']: + tags_parts.append(csv_data['brandName']) + tags = ','.join(tags_parts) if tags_parts else '' + + # Generate prices + price = round(random.uniform(50, 500), 2) + compare_at_price = round(price * random.uniform(1.2, 1.5), 2) + cost_price = round(price * 0.6, 2) + + # Generate random stock + inventory_quantity = random.randint(0, 100) + + # Generate random weight + weight = round(random.uniform(0.1, 5.0), 2) + weight_unit = 'kg' + + # Use skuId as SKU code + sku_code = csv_data['skuId'] or f'SKU-{base_sku_id}' + + # Generate barcode + try: + sku_id = int(csv_data['skuId']) if csv_data['skuId'] else base_sku_id + barcode = f"BAR{sku_id:08d}" + except: + barcode = f"BAR{base_sku_id:08d}" + + # Build description + description = f"

{csv_data['name']}

" if csv_data['name'] else '' + + # Build brief (subtitle) + brief = csv_data['name'] or '' + + # Excel row data + excel_row = { + '商品ID': '', # Empty for new products + '创建时间': create_time_str, + '商品标题*': title, + '商品属性*': 'S', # Single variant product + '商品副标题': brief, + '商品描述': description, + 'SEO标题': seo_title, + 'SEO描述': seo_description, + 'SEO URL Handle': handle, + 'SEO URL 重定向': 'N', + 'SEO关键词': seo_keywords, + '商品上架': 'Y', + '需要物流': 'Y', + '商品收税': 'N', + '商品spu': '', + '启用虚拟销量': 'N', + '虚拟销量值': '', + '跟踪库存': 'Y', + '库存规则*': '1', + '专辑名称': csv_data['categoryName'] or '', + '标签': tags, + '供应商名称': csv_data['supplierName'] or '', + '供应商URL': '', + '款式1': '', # Empty for S type + '款式2': '', # Empty for S type + '款式3': '', # Empty for S type + '商品售价*': price, + '商品原价': compare_at_price, + '成本价': cost_price, + '商品SKU': sku_code, + '商品重量': weight, + '重量单位': weight_unit, + '商品条形码': barcode, + '商品库存': inventory_quantity, + '尺寸信息': '', + '原产地国别': '', + 'HS(协调制度)代码': '', + '商品图片*': csv_data['imageUrl'] or '', + '商品备注': '', + '款式备注': '', + '商品主图': csv_data['imageUrl'] or '', + } + + return excel_row + + +def generate_multi_variant_rows(csv_data: dict, base_sku_id: int = 1) -> list: + """ + Generate Excel rows for Multi variant (M+P type) product. + + Returns a list of rows: + - First row: M (主商品) with option names + - Following rows: P (子款式) with option values + + Args: + csv_data: Parsed CSV row data + base_sku_id: Base SKU ID for generating SKU codes + + Returns: + List of dictionaries mapping Excel column names to values + """ + rows = [] + + # Parse create_time + try: + created_at = datetime.strptime(csv_data['create_time'], '%Y-%m-%d %H:%M:%S') + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S') + except: + created_at = datetime.now() - timedelta(days=random.randint(1, 365)) + create_time_str = created_at.strftime('%Y-%m-%d %H:%M:%S') + + # Generate title + title = csv_data['name'] or csv_data['enSpuName'] or 'Product' + + # Generate handle + handle_source = csv_data['enSpuName'] or csv_data['name_pinyin'] or title + handle = generate_handle(handle_source) + if handle and not handle.startswith('products/'): + handle = f'products/{handle}' + + # Generate SEO fields + seo_title = f"{title} - {csv_data['categoryName']}" if csv_data['categoryName'] else title + seo_description = f"购买{csv_data['brandName']}{title}" if csv_data['brandName'] else title + seo_keywords_parts = [title] + if csv_data['categoryName']: + seo_keywords_parts.append(csv_data['categoryName']) + if csv_data['brandName']: + seo_keywords_parts.append(csv_data['brandName']) + seo_keywords = ','.join(seo_keywords_parts) + + # Generate tags + tags_parts = [] + if csv_data['categoryName']: + tags_parts.append(csv_data['categoryName']) + if csv_data['brandName']: + tags_parts.append(csv_data['brandName']) + tags = ','.join(tags_parts) if tags_parts else '' + + # Extract material from title (last word after splitting by space) + material = extract_material_from_title(title) + + # Generate color options: randomly select 2-10 colors from COLORS list + num_colors = random.randint(2, 10) + selected_colors = random.sample(COLORS, min(num_colors, len(COLORS))) + + # Generate size options: 1-30, randomly select 4-8 + num_sizes = random.randint(4, 8) + all_sizes = [str(i) for i in range(1, 31)] + selected_sizes = random.sample(all_sizes, num_sizes) + + # Material has only one value + materials = [material] + + # Generate all combinations (Cartesian product) + variants = list(itertools.product(selected_colors, selected_sizes, materials)) + + # Generate M row (主商品) + description = f"

{csv_data['name']}

" if csv_data['name'] else '' + brief = csv_data['name'] or '' + + m_row = { + '商品ID': '', + '创建时间': create_time_str, + '商品标题*': title, + '商品属性*': 'M', # Main product + '商品副标题': brief, + '商品描述': description, + 'SEO标题': seo_title, + 'SEO描述': seo_description, + 'SEO URL Handle': handle, + 'SEO URL 重定向': 'N', + 'SEO关键词': seo_keywords, + '商品上架': 'Y', + '需要物流': 'Y', + '商品收税': 'N', + '商品spu': '', + '启用虚拟销量': 'N', + '虚拟销量值': '', + '跟踪库存': 'Y', + '库存规则*': '1', + '专辑名称': csv_data['categoryName'] or '', + '标签': tags, + '供应商名称': csv_data['supplierName'] or '', + '供应商URL': '', + '款式1': 'color', # Option name + '款式2': 'size', # Option name + '款式3': 'material', # Option name + '商品售价*': '', # Empty for M row + '商品原价': '', + '成本价': '', + '商品SKU': '', # Empty for M row + '商品重量': '', + '重量单位': '', + '商品条形码': '', + '商品库存': '', # Empty for M row + '尺寸信息': '', + '原产地国别': '', + 'HS(协调制度)代码': '', + '商品图片*': csv_data['imageUrl'] or '', # Main product image + '商品备注': '', + '款式备注': '', + '商品主图': csv_data['imageUrl'] or '', + } + rows.append(m_row) + + # Generate P rows (子款式) for each variant combination + base_price = round(random.uniform(50, 500), 2) + + for variant_idx, (color, size, mat) in enumerate(variants): + # Generate price variation (within ±20% of base) + price = round(base_price * random.uniform(0.8, 1.2), 2) + compare_at_price = round(price * random.uniform(1.2, 1.5), 2) + cost_price = round(price * 0.6, 2) + + # Generate random stock + inventory_quantity = random.randint(0, 100) + + # Generate random weight + weight = round(random.uniform(0.1, 5.0), 2) + weight_unit = 'kg' + + # Generate SKU code + sku_code = f"{csv_data['skuId']}-{color}-{size}-{mat}" if csv_data['skuId'] else f'SKU-{base_sku_id}-{variant_idx+1}' + + # Generate barcode + barcode = f"BAR{base_sku_id:08d}{variant_idx+1:03d}" + + p_row = { + '商品ID': '', + '创建时间': create_time_str, + '商品标题*': title, # Same as M row + '商品属性*': 'P', # Variant + '商品副标题': '', # Empty for P row + '商品描述': '', # Empty for P row + 'SEO标题': '', # Empty for P row + 'SEO描述': '', # Empty for P row + 'SEO URL Handle': '', # Empty for P row + 'SEO URL 重定向': '', + 'SEO关键词': '', + '商品上架': 'Y', + '需要物流': 'Y', + '商品收税': 'N', + '商品spu': '', + '启用虚拟销量': 'N', + '虚拟销量值': '', + '跟踪库存': 'Y', + '库存规则*': '1', + '专辑名称': '', # Empty for P row + '标签': '', # Empty for P row + '供应商名称': '', # Empty for P row + '供应商URL': '', + '款式1': color, # Option value + '款式2': size, # Option value + '款式3': mat, # Option value + '商品售价*': price, + '商品原价': compare_at_price, + '成本价': cost_price, + '商品SKU': sku_code, + '商品重量': weight, + '重量单位': weight_unit, + '商品条形码': barcode, + '商品库存': inventory_quantity, + '尺寸信息': '', + '原产地国别': '', + 'HS(协调制度)代码': '', + '商品图片*': '', # Empty for P row (uses main product image) + '商品备注': '', + '款式备注': '', + '商品主图': '', + } + rows.append(p_row) + + return rows + + +def read_csv_file(csv_file: str) -> list: + """ + Read CSV file and return list of parsed rows. + + Args: + csv_file: Path to CSV file + + Returns: + List of parsed CSV data dictionaries + """ + csv_data_list = [] + + with open(csv_file, 'r', encoding='utf-8') as f: + reader = csv.DictReader(f) + for row in reader: + parsed = parse_csv_row(row) + csv_data_list.append(parsed) + + return csv_data_list + + +def create_excel_from_template(template_file: str, output_file: str, excel_rows: list): + """ + Create Excel file from template and fill with data rows. + + Args: + template_file: Path to Excel template file + output_file: Path to output Excel file + excel_rows: List of dictionaries mapping Excel column names to values + """ + # Load template + wb = load_workbook(template_file) + ws = wb.active # Use the active sheet (Sheet4) + + # Find header row (row 2) + header_row_idx = 2 + + # Get column mapping from header row + column_mapping = {} + for col_idx in range(1, ws.max_column + 1): + cell_value = ws.cell(row=header_row_idx, column=col_idx).value + if cell_value: + column_mapping[cell_value] = col_idx + + # Start writing data from row 4 + data_start_row = 4 + + # Clear existing data rows + last_template_row = ws.max_row + if last_template_row >= data_start_row: + for row in range(data_start_row, last_template_row + 1): + for col in range(1, ws.max_column + 1): + ws.cell(row=row, column=col).value = None + + # Write data rows + for row_idx, excel_row in enumerate(excel_rows): + excel_row_num = data_start_row + row_idx + + # Write each field to corresponding column + for field_name, col_idx in column_mapping.items(): + if field_name in excel_row: + cell = ws.cell(row=excel_row_num, column=col_idx) + value = excel_row[field_name] + cell.value = value + + # Set alignment + if isinstance(value, str): + cell.alignment = Alignment(vertical='top', wrap_text=True) + elif isinstance(value, (int, float)): + cell.alignment = Alignment(vertical='top') + + # Save workbook + wb.save(output_file) + print(f"Excel file created: {output_file}") + print(f" - Total rows: {len(excel_rows)}") + + +def main(): + parser = argparse.ArgumentParser(description='Convert CSV data to Excel import template with multi-variant support') + parser.add_argument('--csv-file', + default='data/customer1/goods_with_pic.5years_congku.csv.shuf.1w', + help='CSV file path') + parser.add_argument('--template', + default='docs/商品导入模板.xlsx', + help='Excel template file path') + parser.add_argument('--output', + default='商品导入数据.xlsx', + help='Output Excel file path') + parser.add_argument('--limit', + type=int, + default=None, + help='Limit number of products to process') + parser.add_argument('--single-ratio', + type=float, + default=0.3, + help='Ratio of single variant products (default: 0.3 = 30%%)') + parser.add_argument('--seed', + type=int, + default=None, + help='Random seed for reproducible results') + + args = parser.parse_args() + + # Set random seed if provided + if args.seed is not None: + random.seed(args.seed) + + # Check if files exist + if not os.path.exists(args.csv_file): + print(f"Error: CSV file not found: {args.csv_file}") + sys.exit(1) + + if not os.path.exists(args.template): + print(f"Error: Template file not found: {args.template}") + sys.exit(1) + + # Read CSV file + print(f"Reading CSV file: {args.csv_file}") + csv_data_list = read_csv_file(args.csv_file) + print(f"Read {len(csv_data_list)} rows from CSV") + + # Limit products if specified + if args.limit: + csv_data_list = csv_data_list[:args.limit] + print(f"Limited to {len(csv_data_list)} products") + + # Generate Excel rows + print(f"\nGenerating Excel rows...") + print(f" - Single variant ratio: {args.single_ratio*100:.0f}%") + print(f" - Multi variant ratio: {(1-args.single_ratio)*100:.0f}%") + + excel_rows = [] + single_count = 0 + multi_count = 0 + + for idx, csv_data in enumerate(csv_data_list): + # Decide if this product should be single or multi variant + is_single = random.random() < args.single_ratio + + if is_single: + # Generate single variant (S type) + row = generate_single_variant_row(csv_data, base_sku_id=idx+1) + excel_rows.append(row) + single_count += 1 + else: + # Generate multi variant (M+P type) + rows = generate_multi_variant_rows(csv_data, base_sku_id=idx+1) + excel_rows.extend(rows) + multi_count += 1 + + print(f"\nGenerated:") + print(f" - Single variant products: {single_count}") + print(f" - Multi variant products: {multi_count}") + print(f" - Total Excel rows: {len(excel_rows)}") + + # Create Excel file + print(f"\nCreating Excel file from template: {args.template}") + print(f"Output file: {args.output}") + create_excel_from_template(args.template, args.output, excel_rows) + + print(f"\nDone! Generated {len(excel_rows)} rows in Excel file.") + + +if __name__ == '__main__': + main() + -- libgit2 0.21.2