# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a comprehensive **Recommendation System** built for a B2B e-commerce platform. The system generates offline recommendation indices including item-to-item similarity (i2i) and interest aggregation indices, supporting online recommendation services with high-performance algorithms. **Tech Stack**: Python 3.x, Pandas, NumPy, NetworkX, Gensim, C++ (Swing algorithm), Redis, Elasticsearch, MySQL ## ๐Ÿ—๏ธ System Architecture ### High-Level Components ``` recommendation/ โ”œโ”€โ”€ offline_tasks/ # Main offline processing engine โ”œโ”€โ”€ graphembedding/ # Graph-based embedding algorithms โ”œโ”€โ”€ refers/ # Reference materials and data โ”œโ”€โ”€ config.py # Global configuration โ””โ”€โ”€ requirements.txt # Python dependencies ``` ### Core Modules #### 1. Offline Tasks (`/offline_tasks/`) **Primary Purpose**: Generate recommendation indices through various ML algorithms **Key Features**: - **4 i2i similarity algorithms**: Swing (C++ & Python), Session W2V, DeepWalk, Content-based - **11-dimension interest aggregation**: Platform, client, category, supplier dimensions - **Automated pipeline**: One-command execution with memory monitoring - **High-performance C++ integration**: 10-100x faster Swing implementation **Directory Structure**: ``` offline_tasks/ โ”œโ”€โ”€ scripts/ # All algorithm implementations โ”‚ โ”œโ”€โ”€ fetch_item_attributes.py # Preprocessing: item metadata โ”‚ โ”œโ”€โ”€ generate_session.py # Preprocessing: user sessions โ”‚ โ”œโ”€โ”€ i2i_swing.py # Swing algorithm (Python) โ”‚ โ”œโ”€โ”€ i2i_session_w2v.py # Session Word2Vec โ”‚ โ”œโ”€โ”€ i2i_deepwalk.py # DeepWalk with tag-based walks โ”‚ โ”œโ”€โ”€ i2i_content_similar.py # Content-based similarity โ”‚ โ”œโ”€โ”€ interest_aggregation.py # Multi-dimensional aggregation โ”‚ โ””โ”€โ”€ load_index_to_redis.py # Redis import โ”œโ”€โ”€ collaboration/ # C++ Swing algorithm (high-performance) โ”‚ โ”œโ”€โ”€ src/ # C++ source files โ”‚ โ”œโ”€โ”€ run.sh # Build and execute script โ”‚ โ””โ”€โ”€ output/ # C++ algorithm outputs โ”œโ”€โ”€ config/ โ”‚ โ””โ”€โ”€ offline_config.py # Configuration file โ”œโ”€โ”€ doc/ # Comprehensive documentation โ”œโ”€โ”€ output/ # Generated indices โ”œโ”€โ”€ logs/ # Execution logs โ”œโ”€โ”€ run.sh # Main execution script (โญ RECOMMENDED) โ””โ”€โ”€ README.md # Module documentation ``` #### 2. Graph Embedding (`/graphembedding/`) **Purpose**: Advanced graph-based embedding algorithms for content-aware recommendations **Components**: - **DeepWalk**: Enhanced with tag-based random walks for diversity - **Session W2V**: Session-based word embeddings - **Improvements**: Tag-based walks, Softmax sampling, multi-process support #### 3. Configuration (`/config.py`) **Global Settings**: - **Elasticsearch**: Host, credentials, index configuration - **Redis**: Cache configuration, timeouts, expiration policies - **Database**: External database connection parameters ## ๐Ÿš€ Development Workflow ### Quick Start ```bash # 1. Install dependencies cd /data/tw/recommendation/offline_tasks bash install.sh # 2. Test connections python3 test_connection.py # 3. Run full pipeline (recommended) bash run.sh # 4. Run individual algorithms python3 scripts/i2i_swing.py --lookback_days 730 --debug python3 scripts/interest_aggregation.py --lookback_days 730 --top_n 1000 ``` ### Common Development Commands **Setup and Installation:** ```bash # Install Python dependencies pip install -r requirements.txt # Build C++ Swing algorithm cd offline_tasks/collaboration && make # Activate conda environment (required) conda activate tw ``` **Testing:** ```bash # Test database and Redis connections python3 offline_tasks/test_connection.py # Test Elasticsearch connection python3 offline_tasks/scripts/test_es_connection.py ``` **Build and Compilation:** ```bash # Build C++ algorithms cd offline_tasks/collaboration make clean && make # Clean build artifacts make clean ``` **Running Individual Components:** ```bash # Generate session data python3 offline_tasks/scripts/generate_session.py --lookback_days 730 --debug # Run C++ Swing algorithm cd offline_tasks/collaboration && bash run.sh # Load indices to Redis python3 offline_tasks/scripts/load_index_to_redis.py --redis-host localhost --redis-port 6379 ``` ### Algorithm Execution Order The system follows this optimized execution pipeline: 1. **Preprocessing Tasks** (Run once per session) - `fetch_item_attributes.py` โ†’ Item metadata mapping - `generate_session.py` โ†’ User behavior sessions 2. **Core Algorithms** - C++ Swing (`collaboration/run.sh`) โ†’ High-performance similarity - Python Swing (`i2i_swing.py`) โ†’ Time-aware similarity - Session W2V (`i2i_session_w2v.py`) โ†’ Sequence-based similarity - DeepWalk (`i2i_deepwalk.py`) โ†’ Graph-based embeddings - Content Similarity (`i2i_content_similar.py`) โ†’ Attribute-based 3. **Post-processing** - Interest Aggregation (`interest_aggregation.py`) โ†’ Multi-dimensional indices - Redis Import (`load_index_to_redis.py`) โ†’ Online serving ### Key Configuration Files #### Main Configuration (`offline_config.py`) ```python # Critical settings DEFAULT_LOOKBACK_DAYS = 730 # Historical data window DEFAULT_I2I_TOP_N = 50 # Similar items per product DEFAULT_INTEREST_TOP_N = 1000 # Aggregated items per dimension # Algorithm parameters I2I_CONFIG = { 'swing': {'alpha': 0.5, 'threshold1': 0.5, 'threshold2': 0.5}, 'session_w2v': {'vector_size': 128, 'window_size': 5}, 'deepwalk': {'num_walks': 10, 'walk_length': 40} } # Behavior weights for different user actions behavior_weights = { 'purchase': 10.0, 'contactFactory': 5.0, 'addToCart': 3.0, 'addToPool': 2.0 } ``` #### Database Configuration (`config.py`) ```python # External database DB_CONFIG = { 'host': 'selectdb-cn-wuf3vsokg05-public.selectdbfe.rds.aliyuncs.com', 'port': '9030', 'database': 'datacenter', 'username': 'readonly', 'password': 'essa1234' } # Redis for online serving REDIS_CONFIG = { 'host': 'localhost', 'port': 6379, 'cache_expire_days': 180 } ``` ## ๐Ÿ”ง Key Algorithms & Features ### 1. Swing Algorithm (Dual Implementation) **C++ Version** (Production): - **Performance**: 10-100x faster than Python - **Use Case**: Large-scale production processing - **Output**: Raw similarity scores - **Location**: `collaboration/` **Python Version** (Development/Enhanced): - **Features**: Time decay, daily session support - **Use Case**: Development, debugging, parameter tuning - **Output**: Normalized scores with readable names - **Location**: `scripts/i2i_swing.py` ### 2. DeepWalk with Tag Enhancement **Innovative Features**: - **Tag-based walks**: 20% probability of content-guided walks - **Softmax sampling**: Temperature-controlled diversity - **Multi-process**: Parallel walk generation - **Purpose**: Solves recommendation homogeneity issues ### 3. Interest Aggregation **Multi-dimensional Support**: - **7 single dimensions**: platform, client_platform, supplier, category_level1-4 - **4 combined dimensions**: platform_client, platform_category2/3, client_category2 - **3 list types**: hot (popular), cart (cart additions), new (recent), global (overall) ## ๐Ÿ“Š Data Pipeline ### Input Data Sources - **User Behavior**: Purchase, contact, cart, pool interactions - **Item Metadata**: Categories, suppliers, attributes - **Session Data**: Time-weighted user behavior sequences ### Output Formats ``` # i2i Similarity (item-to-item) item_id \t similar_id1:score1,similar_id2:score2,... # Interest Aggregation dimension:value \t item_id1,item_id2,item_id3,... # Redis Keys item:similar:swing_cpp:12345 interest:hot:platform:pc ``` ### Storage Architecture - **Redis**: Fast online serving (400MB memory footprint) - **Elasticsearch**: Vector similarity search - **Local Files**: Raw algorithm outputs for debugging ## ๐Ÿ› Development Guidelines ### Adding New Algorithms 1. **Create script in `scripts/`**: ```python import from db_service, config.offline_config, debug_utils Follow existing pattern: fetch_data โ†’ process โ†’ save_output ``` 2. **Update configuration** in `offline_config.py`: ```python NEW_ALGORITHM_CONFIG = { 'param1': value1, 'param2': value2 } ``` 3. **Add to execution pipeline** in `run.sh` or `run_all.py` ### Debugging Practices - **Use debug mode**: `--debug` flag for readable outputs - **Check logs**: `logs/run_all_YYYYMMDD.log` - **Validate data**: `debug_utils.py` provides data validation - **Monitor memory**: System includes memory monitoring ### Performance Optimization - **Database optimization**: Preprocessing reduces queries by 80-90% - **C++ integration**: Critical for production performance - **Parallel processing**: Multi-threaded algorithms - **Memory management**: Configurable thresholds and monitoring ### Code Quality This codebase does not have formal linting or testing frameworks configured. When making changes: - **Python**: Follow PEP 8 style guidelines - **C++**: Use the existing coding style in collaboration/src/ - **No formal unit tests**: Test functionality manually using the debug modes - **Manual testing**: Use `--debug` flags for readable outputs during development ## ๐Ÿ”„ Maintenance & Operations ### Daily Execution ```bash # Recommended production command 0 2 * * * cd /data/tw/recommendation/offline_tasks && bash run.sh ``` ### Monitoring - **Logs**: `logs/` directory with date-based rotation - **Memory**: Built-in memory monitoring with kill thresholds - **Output Validation**: Automated data quality checks - **Error Handling**: Comprehensive logging and recovery ### Backup Strategy - **Output files**: Daily snapshots in `output/` - **Configuration**: Version-controlled configs - **Logs**: 180-day retention with cleanup ## ๐ŸŽฏ Key Architecture Decisions ### 1. Hybrid Algorithm Approach - **Problem**: Python Swing too slow for production (can take hours) - **Solution**: C++ core for performance + Python for flexibility and debugging - **Benefit**: C++ version is 10-100x faster, Python version provides enhanced features and readability ### 2. Preprocessing Optimization - **Problem**: Repeated database queries across algorithms - **Solution**: Centralized metadata and session generation via `fetch_item_attributes.py` and `generate_session.py` - **Benefit**: 80-90% reduction in database load ### 3. Multi-dimensional Interest Aggregation - **Problem**: Need for flexible recommendation personalization - **Solution**: 11 dimensions with 3 list types each - **Benefit**: Supports diverse business scenarios ### 4. Tag-enhanced DeepWalk - **Problem**: Recommendation homogeneity - **Solution**: Content-aware random walks - **Benefit**: Improved diversity and serendipity ### 5. Environment Management - **Problem**: Dependency isolation and reproducibility - **Solution**: Conda environment named `tw` - **Benefit**: Consistent Python environment across development and production ## ๐Ÿ“š Documentation Resources ### Core Documentation - **[offline_tasks/doc/่ฏฆ็ป†่ฎพ่ฎกๆ–‡ๆกฃ.md](offline_tasks/doc/่ฏฆ็ป†่ฎพ่ฎกๆ–‡ๆกฃ.md)** - Complete system architecture - **[offline_tasks/doc/็ฆป็บฟ็ดขๅผ•ๆ•ฐๆฎ่ง„่Œƒ.md](offline_tasks/doc/็ฆป็บฟ็ดขๅผ•ๆ•ฐๆฎ่ง„่Œƒ.md)** - Data format specifications - **[offline_tasks/doc/Redisๆ•ฐๆฎ่ง„่Œƒ.md](offline_tasks/doc/Redisๆ•ฐๆฎ่ง„่Œƒ.md)** - Redis integration guide - **[offline_tasks/README.md](offline_tasks/README.md)** - Quick start guide ### Algorithm Documentation - **[graphembedding/deepwalk/README.md](graphembedding/deepwalk/README.md)** - DeepWalk with tag enhancements - **[collaboration/README.md](collaboration/README.md)** - C++ Swing algorithm - **[collaboration/Swingๅฟซ้€Ÿๅผ€ๅง‹.md](collaboration/Swingๅฟซ้€Ÿๅผ€ๅง‹.md)** - Swing implementation guide ## ๐Ÿšจ Important Notes for Development 1. **Environment**: Uses Conda environment `tw` - activate before running 2. **Database**: Read-only access to external database 3. **Redis**: Local instance for development, configurable for production 4. **Memory**: Algorithms are memory-intensive - monitor usage 5. **Output**: All files include date stamps for versioning 6. **Testing**: Always test with small datasets before production runs ## ๐Ÿ”— Related Components - **Online Services**: Redis-based recommendation serving - **Elasticsearch**: Vector similarity search capabilities - **Frontend APIs**: Recommendation interfaces for different platforms - **Monitoring**: Performance metrics and error tracking --- **Last Updated**: 2024-12-10 **Maintained by**: Recommendation System Team **Status**: Production-ready with active development