<!-- b5a93a00-49d7-4266-8dbf-3d3f708334ed c9ba91cf-2b58-440d-86d1-35b805e5d3cf -->
Configuration and Pipeline Separation Refactoring
Overview
Implement clean separation between Search Configuration (tenant-facing, ES/search focused) and Data Pipeline (internal ETL, script-controlled). Configuration files will only contain search engine settings, while data source and transformation logic will be controlled entirely by script parameters.
Phase 1: Configuration File Cleanup
1.1 Clean BASE Configuration
File: <code>config/schema/base/config.yaml</code>
Remove (data pipeline concerns):
mysql_configsectionmain_tablefieldsku_tablefieldextension_tablefieldsource_tablein field definitionssource_columnin field definitions
Keep (search configuration):
tenant_namees_index_namees_settingsfields(simplified, no source mapping)indexes(search domains)query_configfunction_scorererankspu_configtenant_config(as template)default_facets
Simplify field definitions:
fields:
- name: "title"
type: "TEXT"
analyzer: "chinese_ecommerce"
boost: 3.0
index: true
store: true
# NO source_table, NO source_column
1.2 Update Legacy Configuration
File: <code>config/schema/tenant1_legacy/config.yaml</code>
Apply same cleanup as BASE config, marking it as legacy in comments.
Phase 2: Transformer Architecture Refactoring
2.1 Create Base Transformer Class
File: <code>indexer/base_transformer.py</code> (NEW)
Create abstract base class with shared logic:
__init__with config, encoders, cache_convert_value()- type conversion (shared)_generate_text_embeddings()- text embedding (shared)_generate_image_embeddings()- image embedding (shared)_inject_tenant_id()- tenant_id injection (shared)@abstractmethod transform()- to be implemented by subclasses
2.2 Refactor DataTransformer
File: <code>indexer/data_transformer.py</code>
Changes:
- Inherit from
BaseDataTransformer - Remove dependency on
source_table,source_columnfrom config - Accept field mapping as parameter (from script)
- Implement
transform(df, field_mapping)method
2.3 Refactor SPUDataTransformer
File: <code>indexer/spu_data_transformer.py</code>
Changes:
- Inherit from
BaseDataTransformer - Remove dependency on config's table names
- Accept field mapping as parameter
- Implement
transform(spu_df, sku_df, spu_field_mapping, sku_field_mapping)method
2.4 Create Transformer Factory
File: <code>indexer/transformer_factory.py</code> (NEW)
Factory to create appropriate transformer based on parameters:
class TransformerFactory:
@staticmethod
def create(
transformer_type: str, # 'sku' or 'spu'
config: TenantConfig,
text_encoder=None,
image_encoder=None
) -> BaseDataTransformer:
if transformer_type == 'spu':
return SPUDataTransformer(config, text_encoder, image_encoder)
elif transformer_type == 'sku':
return DataTransformer(config, text_encoder, image_encoder)
else:
raise ValueError(f"Unknown transformer type: {transformer_type}")
2.5 Update Package Exports
File: <code>indexer/__init__.py</code>
Export new structure:
from .base_transformer import BaseDataTransformer
from .data_transformer import DataTransformer
from .spu_data_transformer import SPUDataTransformer
from .transformer_factory import TransformerFactory
__all__ = [
'BaseDataTransformer',
'DataTransformer',
'SPUDataTransformer',
'TransformerFactory', # Recommended for new code
'BulkIndexer',
'IndexingPipeline',
]
Phase 3: Script Refactoring
3.1 Create Unified Ingestion Script
File: <code>scripts/ingest_universal.py</code> (NEW)
Universal ingestion script with full parameter control:
Parameters:
# Search configuration (pure)
--config base # Which search config to use
# Runtime parameters
--tenant-id shop_12345 # REQUIRED tenant identifier
--es-host http://localhost:9200
--es-username elastic
--es-password xxx
# Data source parameters (pipeline concern)
--data-source mysql # mysql, csv, api, etc.
--mysql-host 120.79.247.228
--mysql-port 3316
--mysql-database saas
--mysql-username saas
--mysql-password xxx
# Transformer parameters (pipeline concern)
--transformer spu # spu or sku
--spu-table shoplazza_product_spu
--sku-table shoplazza_product_sku
--shop-id 1 # Filter by shop_id
# Field mapping (optional, uses defaults if not provided)
--field-mapping mapping.json
# Processing parameters
--batch-size 100
--limit 1000
--skip-embeddings
--recreate-index
Logic:
- Load search config (clean, no data source info)
- Set tenant_id from parameter
- Connect to data source based on
--data-sourceparameter - Load data from tables specified by parameters
- Create transformer based on
--transformerparameter - Apply field mapping (default or custom)
- Transform and index
3.2 Update BASE Ingestion Script
File: <code>scripts/ingest_base.py</code>
Update to use script parameters instead of config values:
- Remove dependency on
config.mysql_config - Remove dependency on
config.main_table,config.sku_table - Get all data source info from command-line arguments
- Use TransformerFactory
3.3 Create Field Mapping Helper
File: <code>scripts/field_mapping_generator.py</code> (NEW)
Helper script to generate default field mappings:
# Generate default mapping for Shoplazza SPU schema
python scripts/field_mapping_generator.py \
--source shoplazza \
--level spu \
--output mappings/shoplazza_spu.json
Output example:
{
"spu_fields": {
"id": "id",
"title": "title",
"description": "description",
...
},
"sku_fields": {
"id": "id",
"price": "price",
"sku": "sku",
...
}
}
Phase 4: Configuration Loader Updates
4.1 Simplify ConfigLoader
File: <code>config/config_loader.py</code>
Changes:
- Remove parsing of
mysql_config - Remove parsing of
main_table,sku_table,extension_table - Remove validation of source_table/source_column in fields
- Simplify field parsing (no source mapping)
- Keep validation of ES/search related config
4.2 Update TenantConfig Model
File: <code>config/__init__.py</code> or wherever TenantConfig is defined
Remove attributes:
mysql_configmain_tablesku_tableextension_table
Add attributes:
tenant_id(runtime, default None)
Simplify FieldConfig:
- Remove
source_table - Remove
source_column
Phase 5: Documentation Updates
5.1 Create Pipeline Guide
File: <code>docs/DATA_PIPELINE_GUIDE.md</code> (NEW)
Document:
- Separation of concerns (config vs pipeline)
- How to use
ingest_universal.py - Default field mappings for common sources
- Custom field mapping examples
- Transformer selection guide
5.2 Update BASE Config Guide
File: <code>docs/BASE_CONFIG_GUIDE.md</code>
Update to reflect:
- Config only contains search settings
- No data source configuration
- How tenant_id is injected at runtime
- Examples of using same config with different data sources
5.3 Update API Documentation
File: <code>API_DOCUMENTATION.md</code>
No changes needed (API layer doesn't know about data pipeline).
5.4 Update Design Documentation
File: <code>设计文档.md</code>
Add section on configuration architecture:
- Clear separation between search config and pipeline
- Benefits of this approach
- How to extend for new data sources
Phase 6: Create Default Field Mappings
6.1 Shoplazza SPU Mapping
File: <code>mappings/shoplazza_spu.json</code> (NEW)
Default field mapping for Shoplazza SPU/SKU tables to BASE config fields.
6.2 Shoplazza SKU Mapping (Legacy)
File: <code>mappings/shoplazza_sku_legacy.json</code> (NEW)
Default field mapping for legacy SKU-level indexing.
6.3 CSV Template Mapping
File: <code>mappings/csv_template.json</code> (NEW)
Example mapping for CSV data sources.
Phase 7: Testing & Validation
7.1 Test Script with Different Sources
Test ingest_universal.py with:
- MySQL Shoplazza tables (SPU level)
- MySQL Shoplazza tables (SKU level, legacy)
- CSV files (if time permits)
7.2 Verify Configuration Portability
Test same BASE config with:
- Different data sources
- Different field mappings
- Different transformers
7.3 Update Test Scripts
File: <code>scripts/test_base.sh</code>
Update to use new script parameters.
Phase 8: Migration & Cleanup
8.1 Create Migration Guide
File: <code>docs/CONFIG_MIGRATION_GUIDE.md</code> (NEW)
Guide for migrating from old config format to new:
- What changed
- How to update existing configs
- How to update ingestion scripts
- Breaking changes
8.2 Update Example Configs
Update all example configurations to new format.
8.3 Mark Old Scripts as Deprecated
Add deprecation warnings to scripts that still use old config format.
Key Design Principles
1. Separation of Concerns
Search Configuration (tenant-facing):
- What fields exist in ES
- How fields are analyzed/indexed
- Search strategies and ranking
- Facets and aggregations
- Query processing rules
Data Pipeline (internal):
- Where data comes from
- How to connect to data sources
- Which tables/files to read
- How to transform data
- Field mapping logic
2. Configuration Portability
Same search config can be used with:
- Different data sources (MySQL, CSV, API)
- Different schemas (with appropriate mapping)
- Different transformation strategies
3. Flexibility
Pipeline decisions (transformer, data source, field mapping) made at runtime, not in config.
Migration Path
For Existing Users
- Update config files (remove data source settings)
- Update ingestion commands (add new parameters)
- Optionally create field mapping files for convenience
For New Users
- Copy BASE config (already clean)
- Run
ingest_universal.pywith appropriate parameters - Provide custom field mapping if needed
Success Criteria
- [ ] BASE config contains ZERO data source information
- [ ] Same config works with MySQL and CSV sources
- [ ] Pipeline fully controlled by script parameters
- [ ] Transformers work with external field mapping
- [ ] Documentation clearly separates concerns
- [ ] Tests validate portability
- [ ] Migration guide provided
Estimated Effort
- Configuration cleanup: 2 hours
- Transformer refactoring: 4-5 hours
- Script refactoring: 3-4 hours
- Config loader updates: 2 hours
- Documentation: 2-3 hours
- Testing & validation: 2-3 hours
- Total: 15-19 hours
Benefits
✅ Clean separation of concerns
✅ Configuration reusability across data sources
✅ Tenant doesn't need to understand ETL
✅ Easier to add new data sources
✅ More flexible pipeline control
✅ Reduced configuration complexity
To-dos
- [ ] Clean BASE and legacy configs: remove mysql_config, table names, source_table/source_column from fields
- [ ] Create BaseDataTransformer abstract class with shared logic (type conversion, embeddings, tenant_id)
- [ ] Refactor DataTransformer and SPUDataTransformer to inherit from base, accept field mapping as parameter
- [ ] Create TransformerFactory for creating transformers based on type parameter
- [ ] Create ingest_universal.py with full parameter control for data source, transformer, field mapping
- [ ] Update scripts/ingest_base.py to use parameters instead of config for data source
- [ ] Create field_mapping_generator.py and default mapping files (shoplazza_spu.json, etc.)
- [ ] Simplify ConfigLoader to only parse search config, remove data source parsing
- [ ] Create DATA_PIPELINE_GUIDE.md documenting pipeline approach and config separation
- [ ] Update BASE_CONFIG_GUIDE.md to reflect config-only-search-settings approach
- [ ] Create CONFIG_MIGRATION_GUIDE.md for migrating from old to new config format
- [ ] Test same config with different data sources and validate portability