CLAUDE.md 12.9 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a comprehensive Recommendation System built for a B2B e-commerce platform. The system generates offline recommendation indices including item-to-item similarity (i2i) and interest aggregation indices, supporting online recommendation services with high-performance algorithms.

Tech Stack: Python 3.x, Pandas, NumPy, NetworkX, Gensim, C++ (Swing algorithm), Redis, Elasticsearch, MySQL

๐Ÿ—๏ธ System Architecture

High-Level Components

recommendation/
โ”œโ”€โ”€ offline_tasks/          # Main offline processing engine
โ”œโ”€โ”€ graphembedding/        # Graph-based embedding algorithms  
โ”œโ”€โ”€ refers/                # Reference materials and data
โ”œโ”€โ”€ config.py             # Global configuration
โ””โ”€โ”€ requirements.txt      # Python dependencies

Core Modules

1. Offline Tasks (/offline_tasks/)

Primary Purpose: Generate recommendation indices through various ML algorithms

Key Features:

  • 4 i2i similarity algorithms: Swing (C++ & Python), Session W2V, DeepWalk, Content-based
  • 11-dimension interest aggregation: Platform, client, category, supplier dimensions
  • Automated pipeline: One-command execution with memory monitoring
  • High-performance C++ integration: 10-100x faster Swing implementation

Directory Structure:

offline_tasks/
โ”œโ”€โ”€ scripts/               # All algorithm implementations
โ”‚   โ”œโ”€โ”€ fetch_item_attributes.py    # Preprocessing: item metadata
โ”‚   โ”œโ”€โ”€ generate_session.py         # Preprocessing: user sessions
โ”‚   โ”œโ”€โ”€ i2i_swing.py                # Swing algorithm (Python)
โ”‚   โ”œโ”€โ”€ i2i_session_w2v.py          # Session Word2Vec
โ”‚   โ”œโ”€โ”€ i2i_deepwalk.py             # DeepWalk with tag-based walks
โ”‚   โ”œโ”€โ”€ i2i_content_similar.py      # Content-based similarity
โ”‚   โ”œโ”€โ”€ interest_aggregation.py     # Multi-dimensional aggregation
โ”‚   โ””โ”€โ”€ load_index_to_redis.py      # Redis import
โ”œโ”€โ”€ collaboration/         # C++ Swing algorithm (high-performance)
โ”‚   โ”œโ”€โ”€ src/                        # C++ source files
โ”‚   โ”œโ”€โ”€ run.sh                      # Build and execute script
โ”‚   โ””โ”€โ”€ output/                     # C++ algorithm outputs
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ offline_config.py           # Configuration file
โ”œโ”€โ”€ doc/                          # Comprehensive documentation
โ”œโ”€โ”€ output/                       # Generated indices
โ”œโ”€โ”€ logs/                         # Execution logs
โ”œโ”€โ”€ run.sh                        # Main execution script (โญ RECOMMENDED)
โ””โ”€โ”€ README.md                     # Module documentation

2. Graph Embedding (/graphembedding/)

Purpose: Advanced graph-based embedding algorithms for content-aware recommendations

Components:

  • DeepWalk: Enhanced with tag-based random walks for diversity
  • Session W2V: Session-based word embeddings
  • Improvements: Tag-based walks, Softmax sampling, multi-process support

3. Configuration (/config.py)

Global Settings:

  • Elasticsearch: Host, credentials, index configuration
  • Redis: Cache configuration, timeouts, expiration policies
  • Database: External database connection parameters

๐Ÿš€ Development Workflow

Quick Start

# 1. Install dependencies
cd /data/tw/recommendation/offline_tasks
bash install.sh

# 2. Test connections
python3 test_connection.py

# 3. Run full pipeline (recommended)
bash run.sh

# 4. Run individual algorithms
python3 scripts/i2i_swing.py --lookback_days 730 --debug
python3 scripts/interest_aggregation.py --lookback_days 730 --top_n 1000

Common Development Commands

Setup and Installation:

# Install Python dependencies
pip install -r requirements.txt

# Build C++ Swing algorithm
cd offline_tasks/collaboration && make

# Activate conda environment (required)
conda activate tw

Testing:

# Test database and Redis connections
python3 offline_tasks/test_connection.py

# Test Elasticsearch connection
python3 offline_tasks/scripts/test_es_connection.py

Build and Compilation:

# Build C++ algorithms
cd offline_tasks/collaboration
make clean && make

# Clean build artifacts
make clean

Running Individual Components:

# Generate session data
python3 offline_tasks/scripts/generate_session.py --lookback_days 730 --debug

# Run C++ Swing algorithm
cd offline_tasks/collaboration && bash run.sh

# Load indices to Redis
python3 offline_tasks/scripts/load_index_to_redis.py --redis-host localhost --redis-port 6379

Algorithm Execution Order

The system follows this optimized execution pipeline:

  1. Preprocessing Tasks (Run once per session)

    • fetch_item_attributes.py โ†’ Item metadata mapping
    • generate_session.py โ†’ User behavior sessions
  2. Core Algorithms

    • C++ Swing (collaboration/run.sh) โ†’ High-performance similarity
    • Python Swing (i2i_swing.py) โ†’ Time-aware similarity
    • Session W2V (i2i_session_w2v.py) โ†’ Sequence-based similarity
    • DeepWalk (i2i_deepwalk.py) โ†’ Graph-based embeddings
    • Content Similarity (i2i_content_similar.py) โ†’ Attribute-based
  3. Post-processing

    • Interest Aggregation (interest_aggregation.py) โ†’ Multi-dimensional indices
    • Redis Import (load_index_to_redis.py) โ†’ Online serving

Key Configuration Files

Main Configuration (offline_config.py)

# Critical settings
DEFAULT_LOOKBACK_DAYS = 730    # Historical data window
DEFAULT_I2I_TOP_N = 50         # Similar items per product
DEFAULT_INTEREST_TOP_N = 1000   # Aggregated items per dimension

# Algorithm parameters
I2I_CONFIG = {
    'swing': {'alpha': 0.5, 'threshold1': 0.5, 'threshold2': 0.5},
    'session_w2v': {'vector_size': 128, 'window_size': 5},
    'deepwalk': {'num_walks': 10, 'walk_length': 40}
}

# Behavior weights for different user actions
behavior_weights = {
    'purchase': 10.0,
    'contactFactory': 5.0,
    'addToCart': 3.0,
    'addToPool': 2.0
}

Database Configuration (config.py)

# External database
DB_CONFIG = {
    'host': 'selectdb-cn-wuf3vsokg05-public.selectdbfe.rds.aliyuncs.com',
    'port': '9030',
    'database': 'datacenter',
    'username': 'readonly',
    'password': 'essa1234'
}

# Redis for online serving
REDIS_CONFIG = {
    'host': 'localhost',
    'port': 6379,
    'cache_expire_days': 180
}

๐Ÿ”ง Key Algorithms & Features

1. Swing Algorithm (Dual Implementation)

C++ Version (Production):

  • Performance: 10-100x faster than Python
  • Use Case: Large-scale production processing
  • Output: Raw similarity scores
  • Location: collaboration/

Python Version (Development/Enhanced):

  • Features: Time decay, daily session support
  • Use Case: Development, debugging, parameter tuning
  • Output: Normalized scores with readable names
  • Location: scripts/i2i_swing.py

2. DeepWalk with Tag Enhancement

Innovative Features:

  • Tag-based walks: 20% probability of content-guided walks
  • Softmax sampling: Temperature-controlled diversity
  • Multi-process: Parallel walk generation
  • Purpose: Solves recommendation homogeneity issues

3. Interest Aggregation

Multi-dimensional Support:

  • 7 single dimensions: platform, client_platform, supplier, category_level1-4
  • 4 combined dimensions: platform_client, platform_category2/3, client_category2
  • 3 list types: hot (popular), cart (cart additions), new (recent), global (overall)

๐Ÿ“Š Data Pipeline

Input Data Sources

  • User Behavior: Purchase, contact, cart, pool interactions
  • Item Metadata: Categories, suppliers, attributes
  • Session Data: Time-weighted user behavior sequences

Output Formats

# i2i Similarity (item-to-item)
item_id \t similar_id1:score1,similar_id2:score2,...

# Interest Aggregation  
dimension:value \t item_id1,item_id2,item_id3,...

# Redis Keys
item:similar:swing_cpp:12345
interest:hot:platform:pc

Storage Architecture

  • Redis: Fast online serving (400MB memory footprint)
  • Elasticsearch: Vector similarity search
  • Local Files: Raw algorithm outputs for debugging

๐Ÿ› Development Guidelines

Adding New Algorithms

  1. Create script in scripts/:

    import from db_service, config.offline_config, debug_utils
    Follow existing pattern: fetch_data โ†’ process โ†’ save_output
    
  2. Update configuration in offline_config.py:

    NEW_ALGORITHM_CONFIG = {
       'param1': value1,
       'param2': value2
    }
    
  3. Add to execution pipeline in run.sh or run_all.py

Debugging Practices

  • Use debug mode: --debug flag for readable outputs
  • Check logs: logs/run_all_YYYYMMDD.log
  • Validate data: debug_utils.py provides data validation
  • Monitor memory: System includes memory monitoring

Performance Optimization

  • Database optimization: Preprocessing reduces queries by 80-90%
  • C++ integration: Critical for production performance
  • Parallel processing: Multi-threaded algorithms
  • Memory management: Configurable thresholds and monitoring

Code Quality

This codebase does not have formal linting or testing frameworks configured. When making changes:

  • Python: Follow PEP 8 style guidelines
  • C++: Use the existing coding style in collaboration/src/
  • No formal unit tests: Test functionality manually using the debug modes
  • Manual testing: Use --debug flags for readable outputs during development

๐Ÿ”„ Maintenance & Operations

Daily Execution

# Recommended production command
0 2 * * * cd /data/tw/recommendation/offline_tasks && bash run.sh

Monitoring

  • Logs: logs/ directory with date-based rotation
  • Memory: Built-in memory monitoring with kill thresholds
  • Output Validation: Automated data quality checks
  • Error Handling: Comprehensive logging and recovery

Backup Strategy

  • Output files: Daily snapshots in output/
  • Configuration: Version-controlled configs
  • Logs: 180-day retention with cleanup

๐ŸŽฏ Key Architecture Decisions

1. Hybrid Algorithm Approach

  • Problem: Python Swing too slow for production (can take hours)
  • Solution: C++ core for performance + Python for flexibility and debugging
  • Benefit: C++ version is 10-100x faster, Python version provides enhanced features and readability

2. Preprocessing Optimization

  • Problem: Repeated database queries across algorithms
  • Solution: Centralized metadata and session generation via fetch_item_attributes.py and generate_session.py
  • Benefit: 80-90% reduction in database load

3. Multi-dimensional Interest Aggregation

  • Problem: Need for flexible recommendation personalization
  • Solution: 11 dimensions with 3 list types each
  • Benefit: Supports diverse business scenarios

4. Tag-enhanced DeepWalk

  • Problem: Recommendation homogeneity
  • Solution: Content-aware random walks
  • Benefit: Improved diversity and serendipity

5. Environment Management

  • Problem: Dependency isolation and reproducibility
  • Solution: Conda environment named tw
  • Benefit: Consistent Python environment across development and production

๐Ÿ“š Documentation Resources

Core Documentation

Algorithm Documentation

๐Ÿšจ Important Notes for Development

  1. Environment: Uses Conda environment tw - activate before running
  2. Database: Read-only access to external database
  3. Redis: Local instance for development, configurable for production
  4. Memory: Algorithms are memory-intensive - monitor usage
  5. Output: All files include date stamps for versioning
  6. Testing: Always test with small datasets before production runs
  • Online Services: Redis-based recommendation serving
  • Elasticsearch: Vector similarity search capabilities
  • Frontend APIs: Recommendation interfaces for different platforms
  • Monitoring: Performance metrics and error tracking

Last Updated: 2024-12-10
Maintained by: Recommendation System Team
Status: Production-ready with active development