.cursor/plans/%E5%B0%86%E6%95%B0%E6%8D%AEpipeline%E7%9B%B8%E5%85%B3%E9%85%8D%E7%BD%AE%E4%BB%8E%E7%B4%A2%E5%BC%95%E9%85%8D%E7%BD%AE%E4%B8%AD%E5%89%A5%E7%A6%BB.md

<!-- b5a93a00-49d7-4266-8dbf-3d3f708334ed c9ba91cf-2b58-440d-86d1-35b805e5d3cf -->
# Configuration and Pipeline Separation Refactoring
## Overview
Implement clean separation between **Search Configuration** (tenant-facing, ES/search focused) and **Data Pipeline** (internal ETL, script-controlled). Configuration files will only contain search engine settings, while data source and transformation logic will be controlled entirely by script parameters.
## Phase 1: Configuration File Cleanup
### 1.1 Clean BASE Configuration
**File**: [`config/schema/base/config.yaml`](config/schema/base/config.yaml)
**Remove** (data pipeline concerns):
- `mysql_config` section
- `main_table` field
- `sku_table` field  
- `extension_table` field
- `source_table` in field definitions
- `source_column` in field definitions
**Keep** (search configuration):
- `tenant_name`
- `es_index_name`
- `es_settings`
- `fields` (simplified, no source mapping)
- `indexes` (search domains)
- `query_config`
- `function_score`
- `rerank`
- `spu_config`
- `tenant_config` (as template)
- `default_facets`
**Simplify field definitions**:
```yaml
fields:
  - name: "title"
    type: "TEXT"
    analyzer: "chinese_ecommerce"
    boost: 3.0
    index: true
    store: true
    # NO source_table, NO source_column
```
### 1.2 Update Legacy Configuration
**File**: [`config/schema/tenant1_legacy/config.yaml`](config/schema/tenant1_legacy/config.yaml)
Apply same cleanup as BASE config, marking it as legacy in comments.
## Phase 2: Transformer Architecture Refactoring
### 2.1 Create Base Transformer Class
**File**: [`indexer/base_transformer.py`](indexer/base_transformer.py) (NEW)
Create abstract base class with shared logic:
- `__init__` with config, encoders, cache
- `_convert_value()` - type conversion (shared)
- `_generate_text_embeddings()` - text embedding (shared)
- `_generate_image_embeddings()` - image embedding (shared)
- `_inject_tenant_id()` - tenant_id injection (shared)
- `@abstractmethod transform()` - to be implemented by subclasses
### 2.2 Refactor DataTransformer
**File**: [`indexer/data_transformer.py`](indexer/data_transformer.py)
Changes:
- Inherit from `BaseDataTransformer`
- Remove dependency on `source_table`, `source_column` from config
- Accept field mapping as parameter (from script)
- Implement `transform(df, field_mapping)` method
### 2.3 Refactor SPUDataTransformer
**File**: [`indexer/spu_data_transformer.py`](indexer/spu_data_transformer.py)
Changes:
- Inherit from `BaseDataTransformer`
- Remove dependency on config's table names
- Accept field mapping as parameter
- Implement `transform(spu_df, sku_df, spu_field_mapping, sku_field_mapping)` method
### 2.4 Create Transformer Factory
**File**: [`indexer/transformer_factory.py`](indexer/transformer_factory.py) (NEW)
Factory to create appropriate transformer based on parameters:
```python
class TransformerFactory:
    @staticmethod
    def create(
        transformer_type: str,  # 'sku' or 'spu'
        config: TenantConfig,
        text_encoder=None,
        image_encoder=None
    ) -> BaseDataTransformer:
        if transformer_type == 'spu':
            return SPUDataTransformer(config, text_encoder, image_encoder)
        elif transformer_type == 'sku':
            return DataTransformer(config, text_encoder, image_encoder)
        else:
            raise ValueError(f"Unknown transformer type: {transformer_type}")
```
### 2.5 Update Package Exports
**File**: [`indexer/__init__.py`](indexer/**init**.py)
Export new structure:
```python
from .base_transformer import BaseDataTransformer
from .data_transformer import DataTransformer
from .spu_data_transformer import SPUDataTransformer
from .transformer_factory import TransformerFactory
__all__ = [
    'BaseDataTransformer',
    'DataTransformer',
    'SPUDataTransformer',
    'TransformerFactory',  # Recommended for new code
    'BulkIndexer',
    'IndexingPipeline',
]
```
## Phase 3: Script Refactoring
### 3.1 Create Unified Ingestion Script
**File**: [`scripts/ingest_universal.py`](scripts/ingest_universal.py) (NEW)
Universal ingestion script with full parameter control:
**Parameters**:
```bash
# Search configuration (pure)
--config base                      # Which search config to use
# Runtime parameters
--tenant-id shop_12345            # REQUIRED tenant identifier
--es-host http://localhost:9200
--es-username elastic
--es-password xxx
# Data source parameters (pipeline concern)
--data-source mysql               # mysql, csv, api, etc.
--mysql-host 120.79.247.228
--mysql-port 3316
--mysql-database saas
--mysql-username saas
--mysql-password xxx
# Transformer parameters (pipeline concern)
--transformer spu                 # spu or sku
--spu-table shoplazza_product_spu
--sku-table shoplazza_product_sku
--shop-id 1                       # Filter by shop_id
# Field mapping (optional, uses defaults if not provided)
--field-mapping mapping.json
# Processing parameters
--batch-size 100
--limit 1000
--skip-embeddings
--recreate-index
```
**Logic**:
1. Load search config (clean, no data source info)
2. Set tenant_id from parameter
3. Connect to data source based on `--data-source` parameter
4. Load data from tables specified by parameters
5. Create transformer based on `--transformer` parameter
6. Apply field mapping (default or custom)
7. Transform and index
### 3.2 Update BASE Ingestion Script
**File**: [`scripts/ingest_base.py`](scripts/ingest_base.py)
Update to use script parameters instead of config values:
- Remove dependency on `config.mysql_config`
- Remove dependency on `config.main_table`, `config.sku_table`
- Get all data source info from command-line arguments
- Use TransformerFactory
### 3.3 Create Field Mapping Helper
**File**: [`scripts/field_mapping_generator.py`](scripts/field_mapping_generator.py) (NEW)
Helper script to generate default field mappings:
```python
# Generate default mapping for Shoplazza SPU schema
python scripts/field_mapping_generator.py \
  --source shoplazza \
  --level spu \
  --output mappings/shoplazza_spu.json
```
Output example:
```json
{
  "spu_fields": {
    "id": "id",
    "title": "title",
    "description": "description",
    ...
  },
  "sku_fields": {
    "id": "id",
    "price": "price",
    "sku": "sku",
    ...
  }
}
```
## Phase 4: Configuration Loader Updates
### 4.1 Simplify ConfigLoader
**File**: [`config/config_loader.py`](config/config_loader.py)
Changes:
- Remove parsing of `mysql_config`
- Remove parsing of `main_table`, `sku_table`, `extension_table`
- Remove validation of source_table/source_column in fields
- Simplify field parsing (no source mapping)
- Keep validation of ES/search related config
### 4.2 Update TenantConfig Model
**File**: [`config/__init__.py`](config/**init**.py) or wherever TenantConfig is defined
Remove attributes:
- `mysql_config`
- `main_table`
- `sku_table`
- `extension_table`
Add attributes:
- `tenant_id` (runtime, default None)
Simplify FieldConfig:
- Remove `source_table`
- Remove `source_column`
## Phase 5: Documentation Updates
### 5.1 Create Pipeline Guide
**File**: [`docs/DATA_PIPELINE_GUIDE.md`](docs/DATA_PIPELINE_GUIDE.md) (NEW)
Document:
- Separation of concerns (config vs pipeline)
- How to use `ingest_universal.py`
- Default field mappings for common sources
- Custom field mapping examples
- Transformer selection guide
### 5.2 Update BASE Config Guide
**File**: [`docs/BASE_CONFIG_GUIDE.md`](docs/BASE_CONFIG_GUIDE.md)
Update to reflect:
- Config only contains search settings
- No data source configuration
- How tenant_id is injected at runtime
- Examples of using same config with different data sources
### 5.3 Update API Documentation
**File**: [`API_DOCUMENTATION.md`](API_DOCUMENTATION.md)
No changes needed (API layer doesn't know about data pipeline).
### 5.4 Update Design Documentation  
**File**: [`设计文档.md`](设计文档.md)
Add section on configuration architecture:
- Clear separation between search config and pipeline
- Benefits of this approach
- How to extend for new data sources
## Phase 6: Create Default Field Mappings
### 6.1 Shoplazza SPU Mapping
**File**: [`mappings/shoplazza_spu.json`](mappings/shoplazza_spu.json) (NEW)
Default field mapping for Shoplazza SPU/SKU tables to BASE config fields.
### 6.2 Shoplazza SKU Mapping (Legacy)
**File**: [`mappings/shoplazza_sku_legacy.json`](mappings/shoplazza_sku_legacy.json) (NEW)
Default field mapping for legacy SKU-level indexing.
### 6.3 CSV Template Mapping
**File**: [`mappings/csv_template.json`](mappings/csv_template.json) (NEW)
Example mapping for CSV data sources.
## Phase 7: Testing & Validation
### 7.1 Test Script with Different Sources
Test `ingest_universal.py` with:
1. MySQL Shoplazza tables (SPU level)
2. MySQL Shoplazza tables (SKU level, legacy)
3. CSV files (if time permits)
### 7.2 Verify Configuration Portability
Test same BASE config with:
- Different data sources
- Different field mappings
- Different transformers
### 7.3 Update Test Scripts
**File**: [`scripts/test_base.sh`](scripts/test_base.sh)
Update to use new script parameters.
## Phase 8: Migration & Cleanup
### 8.1 Create Migration Guide
**File**: [`docs/CONFIG_MIGRATION_GUIDE.md`](docs/CONFIG_MIGRATION_GUIDE.md) (NEW)
Guide for migrating from old config format to new:
- What changed
- How to update existing configs
- How to update ingestion scripts
- Breaking changes
### 8.2 Update Example Configs
Update all example configurations to new format.
### 8.3 Mark Old Scripts as Deprecated
Add deprecation warnings to scripts that still use old config format.
## Key Design Principles
### 1. Separation of Concerns
**Search Configuration** (tenant-facing):
- What fields exist in ES
- How fields are analyzed/indexed
- Search strategies and ranking
- Facets and aggregations
- Query processing rules
**Data Pipeline** (internal):
- Where data comes from
- How to connect to data sources
- Which tables/files to read
- How to transform data
- Field mapping logic
### 2. Configuration Portability
Same search config can be used with:
- Different data sources (MySQL, CSV, API)
- Different schemas (with appropriate mapping)
- Different transformation strategies
### 3. Flexibility
Pipeline decisions (transformer, data source, field mapping) made at runtime, not in config.
## Migration Path
### For Existing Users
1. Update config files (remove data source settings)
2. Update ingestion commands (add new parameters)
3. Optionally create field mapping files for convenience
### For New Users
1. Copy BASE config (already clean)
2. Run `ingest_universal.py` with appropriate parameters
3. Provide custom field mapping if needed
## Success Criteria
- [ ] BASE config contains ZERO data source information
- [ ] Same config works with MySQL and CSV sources
- [ ] Pipeline fully controlled by script parameters
- [ ] Transformers work with external field mapping
- [ ] Documentation clearly separates concerns
- [ ] Tests validate portability
- [ ] Migration guide provided
## Estimated Effort
- Configuration cleanup: 2 hours
- Transformer refactoring: 4-5 hours
- Script refactoring: 3-4 hours
- Config loader updates: 2 hours
- Documentation: 2-3 hours
- Testing & validation: 2-3 hours
- **Total: 15-19 hours**
## Benefits
✅ **Clean separation of concerns**
✅ **Configuration reusability across data sources**
✅ **Tenant doesn't need to understand ETL**
✅ **Easier to add new data sources**
✅ **More flexible pipeline control**
✅ **Reduced configuration complexity**
### To-dos
- [ ] Clean BASE and legacy configs: remove mysql_config, table names, source_table/source_column from fields
- [ ] Create BaseDataTransformer abstract class with shared logic (type conversion, embeddings, tenant_id)
- [ ] Refactor DataTransformer and SPUDataTransformer to inherit from base, accept field mapping as parameter
- [ ] Create TransformerFactory for creating transformers based on type parameter
- [ ] Create ingest_universal.py with full parameter control for data source, transformer, field mapping
- [ ] Update scripts/ingest_base.py to use parameters instead of config for data source
- [ ] Create field_mapping_generator.py and default mapping files (shoplazza_spu.json, etc.)
- [ ] Simplify ConfigLoader to only parse search config, remove data source parsing
- [ ] Create DATA_PIPELINE_GUIDE.md documenting pipeline approach and config separation
- [ ] Update BASE_CONFIG_GUIDE.md to reflect config-only-search-settings approach
- [ ] Create CONFIG_MIGRATION_GUIDE.md for migrating from old to new config format
- [ ] Test same config with different data sources and validate portability