CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a comprehensive Recommendation System built for a B2B e-commerce platform. The system generates offline recommendation indices including item-to-item similarity (i2i) and interest aggregation indices, supporting online recommendation services with high-performance algorithms.
Tech Stack: Python 3.x, Pandas, NumPy, NetworkX, Gensim, C++ (Swing algorithm), Redis, Elasticsearch, MySQL
๐๏ธ System Architecture
High-Level Components
recommendation/
โโโ offline_tasks/ # Main offline processing engine
โโโ graphembedding/ # Graph-based embedding algorithms
โโโ refers/ # Reference materials and data
โโโ config.py # Global configuration
โโโ requirements.txt # Python dependencies
Core Modules
1. Offline Tasks (/offline_tasks/)
Primary Purpose: Generate recommendation indices through various ML algorithms
Key Features:
- 4 i2i similarity algorithms: Swing (C++ & Python), Session W2V, DeepWalk, Content-based
- 11-dimension interest aggregation: Platform, client, category, supplier dimensions
- Automated pipeline: One-command execution with memory monitoring
- High-performance C++ integration: 10-100x faster Swing implementation
Directory Structure:
offline_tasks/
โโโ scripts/ # All algorithm implementations
โ โโโ fetch_item_attributes.py # Preprocessing: item metadata
โ โโโ generate_session.py # Preprocessing: user sessions
โ โโโ i2i_swing.py # Swing algorithm (Python)
โ โโโ i2i_session_w2v.py # Session Word2Vec
โ โโโ i2i_deepwalk.py # DeepWalk with tag-based walks
โ โโโ i2i_content_similar.py # Content-based similarity
โ โโโ interest_aggregation.py # Multi-dimensional aggregation
โ โโโ load_index_to_redis.py # Redis import
โโโ collaboration/ # C++ Swing algorithm (high-performance)
โ โโโ src/ # C++ source files
โ โโโ run.sh # Build and execute script
โ โโโ output/ # C++ algorithm outputs
โโโ config/
โ โโโ offline_config.py # Configuration file
โโโ doc/ # Comprehensive documentation
โโโ output/ # Generated indices
โโโ logs/ # Execution logs
โโโ run.sh # Main execution script (โญ RECOMMENDED)
โโโ README.md # Module documentation
2. Graph Embedding (/graphembedding/)
Purpose: Advanced graph-based embedding algorithms for content-aware recommendations
Components:
- DeepWalk: Enhanced with tag-based random walks for diversity
- Session W2V: Session-based word embeddings
- Improvements: Tag-based walks, Softmax sampling, multi-process support
3. Configuration (/config.py)
Global Settings:
- Elasticsearch: Host, credentials, index configuration
- Redis: Cache configuration, timeouts, expiration policies
- Database: External database connection parameters
๐ Development Workflow
Quick Start
# 1. Install dependencies
cd /data/tw/recommendation/offline_tasks
bash install.sh
# 2. Test connections
python3 test_connection.py
# 3. Run full pipeline (recommended)
bash run.sh
# 4. Run individual algorithms
python3 scripts/i2i_swing.py --lookback_days 730 --debug
python3 scripts/interest_aggregation.py --lookback_days 730 --top_n 1000
Common Development Commands
Setup and Installation:
# Install Python dependencies
pip install -r requirements.txt
# Build C++ Swing algorithm
cd offline_tasks/collaboration && make
# Activate conda environment (required)
conda activate tw
Testing:
# Test database and Redis connections
python3 offline_tasks/test_connection.py
# Test Elasticsearch connection
python3 offline_tasks/scripts/test_es_connection.py
Build and Compilation:
# Build C++ algorithms
cd offline_tasks/collaboration
make clean && make
# Clean build artifacts
make clean
Running Individual Components:
# Generate session data
python3 offline_tasks/scripts/generate_session.py --lookback_days 730 --debug
# Run C++ Swing algorithm
cd offline_tasks/collaboration && bash run.sh
# Load indices to Redis
python3 offline_tasks/scripts/load_index_to_redis.py --redis-host localhost --redis-port 6379
Algorithm Execution Order
The system follows this optimized execution pipeline:
Preprocessing Tasks (Run once per session)
fetch_item_attributes.pyโ Item metadata mappinggenerate_session.pyโ User behavior sessions
Core Algorithms
- C++ Swing (
collaboration/run.sh) โ High-performance similarity - Python Swing (
i2i_swing.py) โ Time-aware similarity - Session W2V (
i2i_session_w2v.py) โ Sequence-based similarity - DeepWalk (
i2i_deepwalk.py) โ Graph-based embeddings - Content Similarity (
i2i_content_similar.py) โ Attribute-based
- C++ Swing (
Post-processing
- Interest Aggregation (
interest_aggregation.py) โ Multi-dimensional indices - Redis Import (
load_index_to_redis.py) โ Online serving
- Interest Aggregation (
Key Configuration Files
Main Configuration (offline_config.py)
# Critical settings
DEFAULT_LOOKBACK_DAYS = 730 # Historical data window
DEFAULT_I2I_TOP_N = 50 # Similar items per product
DEFAULT_INTEREST_TOP_N = 1000 # Aggregated items per dimension
# Algorithm parameters
I2I_CONFIG = {
'swing': {'alpha': 0.5, 'threshold1': 0.5, 'threshold2': 0.5},
'session_w2v': {'vector_size': 128, 'window_size': 5},
'deepwalk': {'num_walks': 10, 'walk_length': 40}
}
# Behavior weights for different user actions
behavior_weights = {
'purchase': 10.0,
'contactFactory': 5.0,
'addToCart': 3.0,
'addToPool': 2.0
}
Database Configuration (config.py)
# External database
DB_CONFIG = {
'host': 'selectdb-cn-wuf3vsokg05-public.selectdbfe.rds.aliyuncs.com',
'port': '9030',
'database': 'datacenter',
'username': 'readonly',
'password': 'essa1234'
}
# Redis for online serving
REDIS_CONFIG = {
'host': 'localhost',
'port': 6379,
'cache_expire_days': 180
}
๐ง Key Algorithms & Features
1. Swing Algorithm (Dual Implementation)
C++ Version (Production):
- Performance: 10-100x faster than Python
- Use Case: Large-scale production processing
- Output: Raw similarity scores
- Location:
collaboration/
Python Version (Development/Enhanced):
- Features: Time decay, daily session support
- Use Case: Development, debugging, parameter tuning
- Output: Normalized scores with readable names
- Location:
scripts/i2i_swing.py
2. DeepWalk with Tag Enhancement
Innovative Features:
- Tag-based walks: 20% probability of content-guided walks
- Softmax sampling: Temperature-controlled diversity
- Multi-process: Parallel walk generation
- Purpose: Solves recommendation homogeneity issues
3. Interest Aggregation
Multi-dimensional Support:
- 7 single dimensions: platform, client_platform, supplier, category_level1-4
- 4 combined dimensions: platform_client, platform_category2/3, client_category2
- 3 list types: hot (popular), cart (cart additions), new (recent), global (overall)
๐ Data Pipeline
Input Data Sources
- User Behavior: Purchase, contact, cart, pool interactions
- Item Metadata: Categories, suppliers, attributes
- Session Data: Time-weighted user behavior sequences
Output Formats
# i2i Similarity (item-to-item)
item_id \t similar_id1:score1,similar_id2:score2,...
# Interest Aggregation
dimension:value \t item_id1,item_id2,item_id3,...
# Redis Keys
item:similar:swing_cpp:12345
interest:hot:platform:pc
Storage Architecture
- Redis: Fast online serving (400MB memory footprint)
- Elasticsearch: Vector similarity search
- Local Files: Raw algorithm outputs for debugging
๐ Development Guidelines
Adding New Algorithms
Create script in
scripts/:import from db_service, config.offline_config, debug_utils Follow existing pattern: fetch_data โ process โ save_outputUpdate configuration in
offline_config.py:NEW_ALGORITHM_CONFIG = { 'param1': value1, 'param2': value2 }Add to execution pipeline in
run.shorrun_all.py
Debugging Practices
- Use debug mode:
--debugflag for readable outputs - Check logs:
logs/run_all_YYYYMMDD.log - Validate data:
debug_utils.pyprovides data validation - Monitor memory: System includes memory monitoring
Performance Optimization
- Database optimization: Preprocessing reduces queries by 80-90%
- C++ integration: Critical for production performance
- Parallel processing: Multi-threaded algorithms
- Memory management: Configurable thresholds and monitoring
Code Quality
This codebase does not have formal linting or testing frameworks configured. When making changes:
- Python: Follow PEP 8 style guidelines
- C++: Use the existing coding style in collaboration/src/
- No formal unit tests: Test functionality manually using the debug modes
- Manual testing: Use
--debugflags for readable outputs during development
๐ Maintenance & Operations
Daily Execution
# Recommended production command
0 2 * * * cd /data/tw/recommendation/offline_tasks && bash run.sh
Monitoring
- Logs:
logs/directory with date-based rotation - Memory: Built-in memory monitoring with kill thresholds
- Output Validation: Automated data quality checks
- Error Handling: Comprehensive logging and recovery
Backup Strategy
- Output files: Daily snapshots in
output/ - Configuration: Version-controlled configs
- Logs: 180-day retention with cleanup
๐ฏ Key Architecture Decisions
1. Hybrid Algorithm Approach
- Problem: Python Swing too slow for production (can take hours)
- Solution: C++ core for performance + Python for flexibility and debugging
- Benefit: C++ version is 10-100x faster, Python version provides enhanced features and readability
2. Preprocessing Optimization
- Problem: Repeated database queries across algorithms
- Solution: Centralized metadata and session generation via
fetch_item_attributes.pyandgenerate_session.py - Benefit: 80-90% reduction in database load
3. Multi-dimensional Interest Aggregation
- Problem: Need for flexible recommendation personalization
- Solution: 11 dimensions with 3 list types each
- Benefit: Supports diverse business scenarios
4. Tag-enhanced DeepWalk
- Problem: Recommendation homogeneity
- Solution: Content-aware random walks
- Benefit: Improved diversity and serendipity
5. Environment Management
- Problem: Dependency isolation and reproducibility
- Solution: Conda environment named
tw - Benefit: Consistent Python environment across development and production
๐ Documentation Resources
Core Documentation
- offline_tasks/doc/่ฏฆ็ป่ฎพ่ฎกๆๆกฃ.md - Complete system architecture
- offline_tasks/doc/็ฆป็บฟ็ดขๅผๆฐๆฎ่ง่.md - Data format specifications
- offline_tasks/doc/Redisๆฐๆฎ่ง่.md - Redis integration guide
- offline_tasks/README.md - Quick start guide
Algorithm Documentation
- graphembedding/deepwalk/README.md - DeepWalk with tag enhancements
- collaboration/README.md - C++ Swing algorithm
- collaboration/Swingๅฟซ้ๅผๅง.md - Swing implementation guide
๐จ Important Notes for Development
- Environment: Uses Conda environment
tw- activate before running - Database: Read-only access to external database
- Redis: Local instance for development, configurable for production
- Memory: Algorithms are memory-intensive - monitor usage
- Output: All files include date stamps for versioning
- Testing: Always test with small datasets before production runs
๐ Related Components
- Online Services: Redis-based recommendation serving
- Elasticsearch: Vector similarity search capabilities
- Frontend APIs: Recommendation interfaces for different platforms
- Monitoring: Performance metrics and error tracking
Last Updated: 2024-12-10
Maintained by: Recommendation System Team
Status: Production-ready with active development