Commit 214eaaa6bd107b5cd170915bd65065ec70a1e009

Authored by tangwang
1 parent acd9b679

docs

Showing 2 changed files with 437 additions and 0 deletions   Show diff stats
CLAUDE.md 0 → 100644
... ... @@ -0,0 +1,383 @@
  1 +# CLAUDE.md
  2 +
  3 +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
  4 +
  5 +## Project Overview
  6 +
  7 +This is a comprehensive **Recommendation System** built for a B2B e-commerce platform. The system generates offline recommendation indices including item-to-item similarity (i2i) and interest aggregation indices, supporting online recommendation services with high-performance algorithms.
  8 +
  9 +**Tech Stack**: Python 3.x, Pandas, NumPy, NetworkX, Gensim, C++ (Swing algorithm), Redis, Elasticsearch, MySQL
  10 +
  11 +## 🏗️ System Architecture
  12 +
  13 +### High-Level Components
  14 +
  15 +```
  16 +recommendation/
  17 +├── offline_tasks/ # Main offline processing engine
  18 +├── graphembedding/ # Graph-based embedding algorithms
  19 +├── refers/ # Reference materials and data
  20 +├── config.py # Global configuration
  21 +└── requirements.txt # Python dependencies
  22 +```
  23 +
  24 +### Core Modules
  25 +
  26 +#### 1. Offline Tasks (`/offline_tasks/`)
  27 +
  28 +**Primary Purpose**: Generate recommendation indices through various ML algorithms
  29 +
  30 +**Key Features**:
  31 +- **4 i2i similarity algorithms**: Swing (C++ & Python), Session W2V, DeepWalk, Content-based
  32 +- **11-dimension interest aggregation**: Platform, client, category, supplier dimensions
  33 +- **Automated pipeline**: One-command execution with memory monitoring
  34 +- **High-performance C++ integration**: 10-100x faster Swing implementation
  35 +
  36 +**Directory Structure**:
  37 +```
  38 +offline_tasks/
  39 +├── scripts/ # All algorithm implementations
  40 +│ ├── fetch_item_attributes.py # Preprocessing: item metadata
  41 +│ ├── generate_session.py # Preprocessing: user sessions
  42 +│ ├── i2i_swing.py # Swing algorithm (Python)
  43 +│ ├── i2i_session_w2v.py # Session Word2Vec
  44 +│ ├── i2i_deepwalk.py # DeepWalk with tag-based walks
  45 +│ ├── i2i_content_similar.py # Content-based similarity
  46 +│ ├── interest_aggregation.py # Multi-dimensional aggregation
  47 +│ └── load_index_to_redis.py # Redis import
  48 +├── collaboration/ # C++ Swing algorithm (high-performance)
  49 +│ ├── src/ # C++ source files
  50 +│ ├── run.sh # Build and execute script
  51 +│ └── output/ # C++ algorithm outputs
  52 +├── config/
  53 +│ └── offline_config.py # Configuration file
  54 +├── doc/ # Comprehensive documentation
  55 +├── output/ # Generated indices
  56 +├── logs/ # Execution logs
  57 +├── run.sh # Main execution script (⭐ RECOMMENDED)
  58 +└── README.md # Module documentation
  59 +```
  60 +
  61 +#### 2. Graph Embedding (`/graphembedding/`)
  62 +
  63 +**Purpose**: Advanced graph-based embedding algorithms for content-aware recommendations
  64 +
  65 +**Components**:
  66 +- **DeepWalk**: Enhanced with tag-based random walks for diversity
  67 +- **Session W2V**: Session-based word embeddings
  68 +- **Improvements**: Tag-based walks, Softmax sampling, multi-process support
  69 +
  70 +#### 3. Configuration (`/config.py`)
  71 +
  72 +**Global Settings**:
  73 +- **Elasticsearch**: Host, credentials, index configuration
  74 +- **Redis**: Cache configuration, timeouts, expiration policies
  75 +- **Database**: External database connection parameters
  76 +
  77 +## 🚀 Development Workflow
  78 +
  79 +### Quick Start
  80 +
  81 +```bash
  82 +# 1. Install dependencies
  83 +cd /data/tw/recommendation/offline_tasks
  84 +bash install.sh
  85 +
  86 +# 2. Test connections
  87 +python3 test_connection.py
  88 +
  89 +# 3. Run full pipeline (recommended)
  90 +bash run.sh
  91 +
  92 +# 4. Run individual algorithms
  93 +python3 scripts/i2i_swing.py --lookback_days 730 --debug
  94 +python3 scripts/interest_aggregation.py --lookback_days 730 --top_n 1000
  95 +```
  96 +
  97 +### Common Development Commands
  98 +
  99 +**Setup and Installation:**
  100 +```bash
  101 +# Install Python dependencies
  102 +pip install -r requirements.txt
  103 +
  104 +# Build C++ Swing algorithm
  105 +cd offline_tasks/collaboration && make
  106 +
  107 +# Activate conda environment (required)
  108 +conda activate tw
  109 +```
  110 +
  111 +**Testing:**
  112 +```bash
  113 +# Test database and Redis connections
  114 +python3 offline_tasks/test_connection.py
  115 +
  116 +# Test Elasticsearch connection
  117 +python3 offline_tasks/scripts/test_es_connection.py
  118 +```
  119 +
  120 +**Build and Compilation:**
  121 +```bash
  122 +# Build C++ algorithms
  123 +cd offline_tasks/collaboration
  124 +make clean && make
  125 +
  126 +# Clean build artifacts
  127 +make clean
  128 +```
  129 +
  130 +**Running Individual Components:**
  131 +```bash
  132 +# Generate session data
  133 +python3 offline_tasks/scripts/generate_session.py --lookback_days 730 --debug
  134 +
  135 +# Run C++ Swing algorithm
  136 +cd offline_tasks/collaboration && bash run.sh
  137 +
  138 +# Load indices to Redis
  139 +python3 offline_tasks/scripts/load_index_to_redis.py --redis-host localhost --redis-port 6379
  140 +```
  141 +
  142 +### Algorithm Execution Order
  143 +
  144 +The system follows this optimized execution pipeline:
  145 +
  146 +1. **Preprocessing Tasks** (Run once per session)
  147 + - `fetch_item_attributes.py` → Item metadata mapping
  148 + - `generate_session.py` → User behavior sessions
  149 +
  150 +2. **Core Algorithms**
  151 + - C++ Swing (`collaboration/run.sh`) → High-performance similarity
  152 + - Python Swing (`i2i_swing.py`) → Time-aware similarity
  153 + - Session W2V (`i2i_session_w2v.py`) → Sequence-based similarity
  154 + - DeepWalk (`i2i_deepwalk.py`) → Graph-based embeddings
  155 + - Content Similarity (`i2i_content_similar.py`) → Attribute-based
  156 +
  157 +3. **Post-processing**
  158 + - Interest Aggregation (`interest_aggregation.py`) → Multi-dimensional indices
  159 + - Redis Import (`load_index_to_redis.py`) → Online serving
  160 +
  161 +### Key Configuration Files
  162 +
  163 +#### Main Configuration (`offline_config.py`)
  164 +```python
  165 +# Critical settings
  166 +DEFAULT_LOOKBACK_DAYS = 730 # Historical data window
  167 +DEFAULT_I2I_TOP_N = 50 # Similar items per product
  168 +DEFAULT_INTEREST_TOP_N = 1000 # Aggregated items per dimension
  169 +
  170 +# Algorithm parameters
  171 +I2I_CONFIG = {
  172 + 'swing': {'alpha': 0.5, 'threshold1': 0.5, 'threshold2': 0.5},
  173 + 'session_w2v': {'vector_size': 128, 'window_size': 5},
  174 + 'deepwalk': {'num_walks': 10, 'walk_length': 40}
  175 +}
  176 +
  177 +# Behavior weights for different user actions
  178 +behavior_weights = {
  179 + 'purchase': 10.0,
  180 + 'contactFactory': 5.0,
  181 + 'addToCart': 3.0,
  182 + 'addToPool': 2.0
  183 +}
  184 +```
  185 +
  186 +#### Database Configuration (`config.py`)
  187 +```python
  188 +# External database
  189 +DB_CONFIG = {
  190 + 'host': 'selectdb-cn-wuf3vsokg05-public.selectdbfe.rds.aliyuncs.com',
  191 + 'port': '9030',
  192 + 'database': 'datacenter',
  193 + 'username': 'readonly',
  194 + 'password': 'essa1234'
  195 +}
  196 +
  197 +# Redis for online serving
  198 +REDIS_CONFIG = {
  199 + 'host': 'localhost',
  200 + 'port': 6379,
  201 + 'cache_expire_days': 180
  202 +}
  203 +```
  204 +
  205 +## 🔧 Key Algorithms & Features
  206 +
  207 +### 1. Swing Algorithm (Dual Implementation)
  208 +
  209 +**C++ Version** (Production):
  210 +- **Performance**: 10-100x faster than Python
  211 +- **Use Case**: Large-scale production processing
  212 +- **Output**: Raw similarity scores
  213 +- **Location**: `collaboration/`
  214 +
  215 +**Python Version** (Development/Enhanced):
  216 +- **Features**: Time decay, daily session support
  217 +- **Use Case**: Development, debugging, parameter tuning
  218 +- **Output**: Normalized scores with readable names
  219 +- **Location**: `scripts/i2i_swing.py`
  220 +
  221 +### 2. DeepWalk with Tag Enhancement
  222 +
  223 +**Innovative Features**:
  224 +- **Tag-based walks**: 20% probability of content-guided walks
  225 +- **Softmax sampling**: Temperature-controlled diversity
  226 +- **Multi-process**: Parallel walk generation
  227 +- **Purpose**: Solves recommendation homogeneity issues
  228 +
  229 +### 3. Interest Aggregation
  230 +
  231 +**Multi-dimensional Support**:
  232 +- **7 single dimensions**: platform, client_platform, supplier, category_level1-4
  233 +- **4 combined dimensions**: platform_client, platform_category2/3, client_category2
  234 +- **3 list types**: hot (popular), cart (cart additions), new (recent), global (overall)
  235 +
  236 +## 📊 Data Pipeline
  237 +
  238 +### Input Data Sources
  239 +- **User Behavior**: Purchase, contact, cart, pool interactions
  240 +- **Item Metadata**: Categories, suppliers, attributes
  241 +- **Session Data**: Time-weighted user behavior sequences
  242 +
  243 +### Output Formats
  244 +```
  245 +# i2i Similarity (item-to-item)
  246 +item_id \t similar_id1:score1,similar_id2:score2,...
  247 +
  248 +# Interest Aggregation
  249 +dimension:value \t item_id1,item_id2,item_id3,...
  250 +
  251 +# Redis Keys
  252 +item:similar:swing_cpp:12345
  253 +interest:hot:platform:pc
  254 +```
  255 +
  256 +### Storage Architecture
  257 +- **Redis**: Fast online serving (400MB memory footprint)
  258 +- **Elasticsearch**: Vector similarity search
  259 +- **Local Files**: Raw algorithm outputs for debugging
  260 +
  261 +## 🐛 Development Guidelines
  262 +
  263 +### Adding New Algorithms
  264 +
  265 +1. **Create script in `scripts/`**:
  266 + ```python
  267 + import from db_service, config.offline_config, debug_utils
  268 + Follow existing pattern: fetch_data → process → save_output
  269 + ```
  270 +
  271 +2. **Update configuration** in `offline_config.py`:
  272 + ```python
  273 + NEW_ALGORITHM_CONFIG = {
  274 + 'param1': value1,
  275 + 'param2': value2
  276 + }
  277 + ```
  278 +
  279 +3. **Add to execution pipeline** in `run.sh` or `run_all.py`
  280 +
  281 +### Debugging Practices
  282 +
  283 +- **Use debug mode**: `--debug` flag for readable outputs
  284 +- **Check logs**: `logs/run_all_YYYYMMDD.log`
  285 +- **Validate data**: `debug_utils.py` provides data validation
  286 +- **Monitor memory**: System includes memory monitoring
  287 +
  288 +### Performance Optimization
  289 +
  290 +- **Database optimization**: Preprocessing reduces queries by 80-90%
  291 +- **C++ integration**: Critical for production performance
  292 +- **Parallel processing**: Multi-threaded algorithms
  293 +- **Memory management**: Configurable thresholds and monitoring
  294 +
  295 +### Code Quality
  296 +
  297 +This codebase does not have formal linting or testing frameworks configured. When making changes:
  298 +
  299 +- **Python**: Follow PEP 8 style guidelines
  300 +- **C++**: Use the existing coding style in collaboration/src/
  301 +- **No formal unit tests**: Test functionality manually using the debug modes
  302 +- **Manual testing**: Use `--debug` flags for readable outputs during development
  303 +
  304 +## 🔄 Maintenance & Operations
  305 +
  306 +### Daily Execution
  307 +```bash
  308 +# Recommended production command
  309 +0 2 * * * cd /data/tw/recommendation/offline_tasks && bash run.sh
  310 +```
  311 +
  312 +### Monitoring
  313 +- **Logs**: `logs/` directory with date-based rotation
  314 +- **Memory**: Built-in memory monitoring with kill thresholds
  315 +- **Output Validation**: Automated data quality checks
  316 +- **Error Handling**: Comprehensive logging and recovery
  317 +
  318 +### Backup Strategy
  319 +- **Output files**: Daily snapshots in `output/`
  320 +- **Configuration**: Version-controlled configs
  321 +- **Logs**: 180-day retention with cleanup
  322 +
  323 +## 🎯 Key Architecture Decisions
  324 +
  325 +### 1. Hybrid Algorithm Approach
  326 +- **Problem**: Python Swing too slow for production (can take hours)
  327 +- **Solution**: C++ core for performance + Python for flexibility and debugging
  328 +- **Benefit**: C++ version is 10-100x faster, Python version provides enhanced features and readability
  329 +
  330 +### 2. Preprocessing Optimization
  331 +- **Problem**: Repeated database queries across algorithms
  332 +- **Solution**: Centralized metadata and session generation via `fetch_item_attributes.py` and `generate_session.py`
  333 +- **Benefit**: 80-90% reduction in database load
  334 +
  335 +### 3. Multi-dimensional Interest Aggregation
  336 +- **Problem**: Need for flexible recommendation personalization
  337 +- **Solution**: 11 dimensions with 3 list types each
  338 +- **Benefit**: Supports diverse business scenarios
  339 +
  340 +### 4. Tag-enhanced DeepWalk
  341 +- **Problem**: Recommendation homogeneity
  342 +- **Solution**: Content-aware random walks
  343 +- **Benefit**: Improved diversity and serendipity
  344 +
  345 +### 5. Environment Management
  346 +- **Problem**: Dependency isolation and reproducibility
  347 +- **Solution**: Conda environment named `tw`
  348 +- **Benefit**: Consistent Python environment across development and production
  349 +
  350 +## 📚 Documentation Resources
  351 +
  352 +### Core Documentation
  353 +- **[offline_tasks/doc/详细设计文档.md](offline_tasks/doc/详细设计文档.md)** - Complete system architecture
  354 +- **[offline_tasks/doc/离线索引数据规范.md](offline_tasks/doc/离线索引数据规范.md)** - Data format specifications
  355 +- **[offline_tasks/doc/Redis数据规范.md](offline_tasks/doc/Redis数据规范.md)** - Redis integration guide
  356 +- **[offline_tasks/README.md](offline_tasks/README.md)** - Quick start guide
  357 +
  358 +### Algorithm Documentation
  359 +- **[graphembedding/deepwalk/README.md](graphembedding/deepwalk/README.md)** - DeepWalk with tag enhancements
  360 +- **[collaboration/README.md](collaboration/README.md)** - C++ Swing algorithm
  361 +- **[collaboration/Swing快速开始.md](collaboration/Swing快速开始.md)** - Swing implementation guide
  362 +
  363 +## 🚨 Important Notes for Development
  364 +
  365 +1. **Environment**: Uses Conda environment `tw` - activate before running
  366 +2. **Database**: Read-only access to external database
  367 +3. **Redis**: Local instance for development, configurable for production
  368 +4. **Memory**: Algorithms are memory-intensive - monitor usage
  369 +5. **Output**: All files include date stamps for versioning
  370 +6. **Testing**: Always test with small datasets before production runs
  371 +
  372 +## 🔗 Related Components
  373 +
  374 +- **Online Services**: Redis-based recommendation serving
  375 +- **Elasticsearch**: Vector similarity search capabilities
  376 +- **Frontend APIs**: Recommendation interfaces for different platforms
  377 +- **Monitoring**: Performance metrics and error tracking
  378 +
  379 +---
  380 +
  381 +**Last Updated**: 2024-12-10
  382 +**Maintained by**: Recommendation System Team
  383 +**Status**: Production-ready with active development
... ...
索引缺失问题排查.md 0 → 100644
... ... @@ -0,0 +1,54 @@
  1 +请你重点关注有哪些维度的索引,各自数据的来龙去脉。
  2 +现在在线推荐服务推荐结果都为空,经过检查时离线索引数据为空,请你检查这些索引清单(含用户特征相关和兜底维度),方便你逐项去离线侧核对/补齐。Redis DB 默认:推荐用 db3,用户画像用 snapshot_db(app_config 中 Redis snapshot_db,默认0)。
  3 +一、首页猜你喜欢 /recommendation/home
  4 +1) 用户行为 I2I(需要用户画像行为;无画像则这些路空)
  5 +item:similar:swing_cpp:{sku} (key_name=user_behavior_click/purchase)
  6 +item:similar:swing:{sku} (同上)
  7 +item:similar:session_w2v:{sku}(同上)
  8 +item:similar:deepwalk:{sku} (同上)
  9 +item:similar:content_name:{sku}(同上)
  10 +item:similar:content_pic:{sku} (同上)
  11 +2) 用户品牌偏好(需要画像品牌;无画像则空)
  12 +interest:hot:{brand_id} (key_name=user_brand_preference,模板 interest:hot:{key},key 为品牌ID字符串)
  13 +3) 用户类目偏好 ES 召回(需要画像类目;无画像则空)
  14 +直接查 ES sale_category_all,不依赖 Redis 索引。
  15 +4) 用户行为类目 ES 召回(需要画像行为类目;无画像则空)
  16 +同上,查 ES,不依赖 Redis 索引。
  17 +5) 兜底兴趣聚合(无画像也应有结果,取决于这些 key 是否存在)
  18 +平台热门:interest:hot:platform:{platform} 示例 interest:hot:platform:essaone
  19 +全局热门:interest:global:global(模板 interest:global:{key},这里 key=global)
  20 +若有客户端维度(目前配置中未直接用 client_key,但可检查是否有 interest:hot:client_platform:{client}、interest:hot:platform_client:{platform}_{client} 供扩展)
  21 +二、详情页大家都在看 /recommendation/detail
  22 +1) 当前商品 I2I(无画像也可用,只要有这些 key)
  23 +item:similar:swing_cpp:{sku_id}
  24 +item:similar:swing:{sku_id}
  25 +item:similar:session_w2v:{sku_id}
  26 +item:similar:deepwalk:{sku_id}
  27 +item:similar:content_name:{sku_id}
  28 +item:similar:content_pic:{sku_id}
  29 +2) 用户行为 I2I(需要用户画像行为;无画像则空)
  30 +同首页的行为 I2I 路径,依赖 item:similar:swing_cpp|swing|session_w2v|deepwalk|content_name|content_pic:{sku}
  31 +三、Redis Key 模板(统一定义)
  32 +I2I:item:similar:swing_cpp:{key} / swing / w2v / session_w2v / deepwalk / content_name / content_pic
  33 +兴趣聚合:interest:hot:{key}、interest:cart:{key}、interest:new:{key}、interest:global:{key}
  34 +四、需要检查/补齐的关键索引清单
  35 +必须有(兜底与主干):
  36 +interest:hot:platform:essaone(或对应平台)——首页兜底
  37 +interest:global:global —— 首页兜底
  38 +item:similar:swing_cpp:{sku} —— 详情页主路、首页行为召回
  39 +item:similar:swing:{sku}
  40 +建议有(覆盖率/多样性):
  41 +item:similar:session_w2v:{sku}
  42 +item:similar:deepwalk:{sku}
  43 +item:similar:content_name:{sku}
  44 +item:similar:content_pic:{sku}
  45 +可选/扩展兜底(若有产出):
  46 +interest:hot:client_platform:{client}
  47 +interest:hot:platform_client:{platform}_{client}
  48 +类目维度热门:interest:hot:category_level2:{id} 等(当前配置未直接用,但有数据可扩展)
  49 +其他兴趣列表:interest:cart:*、interest:new:*
  50 +五、用户画像数据
  51 +Key 形如:user_profile:{user_id}(在 snapshot_db,默认 db0)。缺失画像会导致行为/偏好召回为空,但兜底热门仍可用,只要上述兴趣/全球热门 key 存在。
  52 +你可以据此逐项在离线产出与 Redis db3/snapshot_db 里核对哪些 key 缺失,优先确保兜底热门与当前商品 I2I 存在
  53 +
  54 +请你分析日志,为什么没结果,并且给出完善的方案
... ...