From 214eaaa6bd107b5cd170915bd65065ec70a1e009 Mon Sep 17 00:00:00 2001 From: tangwang Date: Wed, 10 Dec 2025 22:00:11 +0800 Subject: [PATCH] docs --- CLAUDE.md | 383 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 索引缺失问题排查.md | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 437 insertions(+), 0 deletions(-) create mode 100644 CLAUDE.md create mode 100644 索引缺失问题排查.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..361f386 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,383 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Project Overview + +This is a comprehensive **Recommendation System** built for a B2B e-commerce platform. The system generates offline recommendation indices including item-to-item similarity (i2i) and interest aggregation indices, supporting online recommendation services with high-performance algorithms. + +**Tech Stack**: Python 3.x, Pandas, NumPy, NetworkX, Gensim, C++ (Swing algorithm), Redis, Elasticsearch, MySQL + +## 🏗️ System Architecture + +### High-Level Components + +``` +recommendation/ +├── offline_tasks/ # Main offline processing engine +├── graphembedding/ # Graph-based embedding algorithms +├── refers/ # Reference materials and data +├── config.py # Global configuration +└── requirements.txt # Python dependencies +``` + +### Core Modules + +#### 1. Offline Tasks (`/offline_tasks/`) + +**Primary Purpose**: Generate recommendation indices through various ML algorithms + +**Key Features**: +- **4 i2i similarity algorithms**: Swing (C++ & Python), Session W2V, DeepWalk, Content-based +- **11-dimension interest aggregation**: Platform, client, category, supplier dimensions +- **Automated pipeline**: One-command execution with memory monitoring +- **High-performance C++ integration**: 10-100x faster Swing implementation + +**Directory Structure**: +``` +offline_tasks/ +├── scripts/ # All algorithm implementations +│ ├── fetch_item_attributes.py # Preprocessing: item metadata +│ ├── generate_session.py # Preprocessing: user sessions +│ ├── i2i_swing.py # Swing algorithm (Python) +│ ├── i2i_session_w2v.py # Session Word2Vec +│ ├── i2i_deepwalk.py # DeepWalk with tag-based walks +│ ├── i2i_content_similar.py # Content-based similarity +│ ├── interest_aggregation.py # Multi-dimensional aggregation +│ └── load_index_to_redis.py # Redis import +├── collaboration/ # C++ Swing algorithm (high-performance) +│ ├── src/ # C++ source files +│ ├── run.sh # Build and execute script +│ └── output/ # C++ algorithm outputs +├── config/ +│ └── offline_config.py # Configuration file +├── doc/ # Comprehensive documentation +├── output/ # Generated indices +├── logs/ # Execution logs +├── run.sh # Main execution script (⭐ RECOMMENDED) +└── README.md # Module documentation +``` + +#### 2. Graph Embedding (`/graphembedding/`) + +**Purpose**: Advanced graph-based embedding algorithms for content-aware recommendations + +**Components**: +- **DeepWalk**: Enhanced with tag-based random walks for diversity +- **Session W2V**: Session-based word embeddings +- **Improvements**: Tag-based walks, Softmax sampling, multi-process support + +#### 3. Configuration (`/config.py`) + +**Global Settings**: +- **Elasticsearch**: Host, credentials, index configuration +- **Redis**: Cache configuration, timeouts, expiration policies +- **Database**: External database connection parameters + +## 🚀 Development Workflow + +### Quick Start + +```bash +# 1. Install dependencies +cd /data/tw/recommendation/offline_tasks +bash install.sh + +# 2. Test connections +python3 test_connection.py + +# 3. Run full pipeline (recommended) +bash run.sh + +# 4. Run individual algorithms +python3 scripts/i2i_swing.py --lookback_days 730 --debug +python3 scripts/interest_aggregation.py --lookback_days 730 --top_n 1000 +``` + +### Common Development Commands + +**Setup and Installation:** +```bash +# Install Python dependencies +pip install -r requirements.txt + +# Build C++ Swing algorithm +cd offline_tasks/collaboration && make + +# Activate conda environment (required) +conda activate tw +``` + +**Testing:** +```bash +# Test database and Redis connections +python3 offline_tasks/test_connection.py + +# Test Elasticsearch connection +python3 offline_tasks/scripts/test_es_connection.py +``` + +**Build and Compilation:** +```bash +# Build C++ algorithms +cd offline_tasks/collaboration +make clean && make + +# Clean build artifacts +make clean +``` + +**Running Individual Components:** +```bash +# Generate session data +python3 offline_tasks/scripts/generate_session.py --lookback_days 730 --debug + +# Run C++ Swing algorithm +cd offline_tasks/collaboration && bash run.sh + +# Load indices to Redis +python3 offline_tasks/scripts/load_index_to_redis.py --redis-host localhost --redis-port 6379 +``` + +### Algorithm Execution Order + +The system follows this optimized execution pipeline: + +1. **Preprocessing Tasks** (Run once per session) + - `fetch_item_attributes.py` → Item metadata mapping + - `generate_session.py` → User behavior sessions + +2. **Core Algorithms** + - C++ Swing (`collaboration/run.sh`) → High-performance similarity + - Python Swing (`i2i_swing.py`) → Time-aware similarity + - Session W2V (`i2i_session_w2v.py`) → Sequence-based similarity + - DeepWalk (`i2i_deepwalk.py`) → Graph-based embeddings + - Content Similarity (`i2i_content_similar.py`) → Attribute-based + +3. **Post-processing** + - Interest Aggregation (`interest_aggregation.py`) → Multi-dimensional indices + - Redis Import (`load_index_to_redis.py`) → Online serving + +### Key Configuration Files + +#### Main Configuration (`offline_config.py`) +```python +# Critical settings +DEFAULT_LOOKBACK_DAYS = 730 # Historical data window +DEFAULT_I2I_TOP_N = 50 # Similar items per product +DEFAULT_INTEREST_TOP_N = 1000 # Aggregated items per dimension + +# Algorithm parameters +I2I_CONFIG = { + 'swing': {'alpha': 0.5, 'threshold1': 0.5, 'threshold2': 0.5}, + 'session_w2v': {'vector_size': 128, 'window_size': 5}, + 'deepwalk': {'num_walks': 10, 'walk_length': 40} +} + +# Behavior weights for different user actions +behavior_weights = { + 'purchase': 10.0, + 'contactFactory': 5.0, + 'addToCart': 3.0, + 'addToPool': 2.0 +} +``` + +#### Database Configuration (`config.py`) +```python +# External database +DB_CONFIG = { + 'host': 'selectdb-cn-wuf3vsokg05-public.selectdbfe.rds.aliyuncs.com', + 'port': '9030', + 'database': 'datacenter', + 'username': 'readonly', + 'password': 'essa1234' +} + +# Redis for online serving +REDIS_CONFIG = { + 'host': 'localhost', + 'port': 6379, + 'cache_expire_days': 180 +} +``` + +## 🔧 Key Algorithms & Features + +### 1. Swing Algorithm (Dual Implementation) + +**C++ Version** (Production): +- **Performance**: 10-100x faster than Python +- **Use Case**: Large-scale production processing +- **Output**: Raw similarity scores +- **Location**: `collaboration/` + +**Python Version** (Development/Enhanced): +- **Features**: Time decay, daily session support +- **Use Case**: Development, debugging, parameter tuning +- **Output**: Normalized scores with readable names +- **Location**: `scripts/i2i_swing.py` + +### 2. DeepWalk with Tag Enhancement + +**Innovative Features**: +- **Tag-based walks**: 20% probability of content-guided walks +- **Softmax sampling**: Temperature-controlled diversity +- **Multi-process**: Parallel walk generation +- **Purpose**: Solves recommendation homogeneity issues + +### 3. Interest Aggregation + +**Multi-dimensional Support**: +- **7 single dimensions**: platform, client_platform, supplier, category_level1-4 +- **4 combined dimensions**: platform_client, platform_category2/3, client_category2 +- **3 list types**: hot (popular), cart (cart additions), new (recent), global (overall) + +## 📊 Data Pipeline + +### Input Data Sources +- **User Behavior**: Purchase, contact, cart, pool interactions +- **Item Metadata**: Categories, suppliers, attributes +- **Session Data**: Time-weighted user behavior sequences + +### Output Formats +``` +# i2i Similarity (item-to-item) +item_id \t similar_id1:score1,similar_id2:score2,... + +# Interest Aggregation +dimension:value \t item_id1,item_id2,item_id3,... + +# Redis Keys +item:similar:swing_cpp:12345 +interest:hot:platform:pc +``` + +### Storage Architecture +- **Redis**: Fast online serving (400MB memory footprint) +- **Elasticsearch**: Vector similarity search +- **Local Files**: Raw algorithm outputs for debugging + +## 🐛 Development Guidelines + +### Adding New Algorithms + +1. **Create script in `scripts/`**: + ```python + import from db_service, config.offline_config, debug_utils + Follow existing pattern: fetch_data → process → save_output + ``` + +2. **Update configuration** in `offline_config.py`: + ```python + NEW_ALGORITHM_CONFIG = { + 'param1': value1, + 'param2': value2 + } + ``` + +3. **Add to execution pipeline** in `run.sh` or `run_all.py` + +### Debugging Practices + +- **Use debug mode**: `--debug` flag for readable outputs +- **Check logs**: `logs/run_all_YYYYMMDD.log` +- **Validate data**: `debug_utils.py` provides data validation +- **Monitor memory**: System includes memory monitoring + +### Performance Optimization + +- **Database optimization**: Preprocessing reduces queries by 80-90% +- **C++ integration**: Critical for production performance +- **Parallel processing**: Multi-threaded algorithms +- **Memory management**: Configurable thresholds and monitoring + +### Code Quality + +This codebase does not have formal linting or testing frameworks configured. When making changes: + +- **Python**: Follow PEP 8 style guidelines +- **C++**: Use the existing coding style in collaboration/src/ +- **No formal unit tests**: Test functionality manually using the debug modes +- **Manual testing**: Use `--debug` flags for readable outputs during development + +## 🔄 Maintenance & Operations + +### Daily Execution +```bash +# Recommended production command +0 2 * * * cd /data/tw/recommendation/offline_tasks && bash run.sh +``` + +### Monitoring +- **Logs**: `logs/` directory with date-based rotation +- **Memory**: Built-in memory monitoring with kill thresholds +- **Output Validation**: Automated data quality checks +- **Error Handling**: Comprehensive logging and recovery + +### Backup Strategy +- **Output files**: Daily snapshots in `output/` +- **Configuration**: Version-controlled configs +- **Logs**: 180-day retention with cleanup + +## 🎯 Key Architecture Decisions + +### 1. Hybrid Algorithm Approach +- **Problem**: Python Swing too slow for production (can take hours) +- **Solution**: C++ core for performance + Python for flexibility and debugging +- **Benefit**: C++ version is 10-100x faster, Python version provides enhanced features and readability + +### 2. Preprocessing Optimization +- **Problem**: Repeated database queries across algorithms +- **Solution**: Centralized metadata and session generation via `fetch_item_attributes.py` and `generate_session.py` +- **Benefit**: 80-90% reduction in database load + +### 3. Multi-dimensional Interest Aggregation +- **Problem**: Need for flexible recommendation personalization +- **Solution**: 11 dimensions with 3 list types each +- **Benefit**: Supports diverse business scenarios + +### 4. Tag-enhanced DeepWalk +- **Problem**: Recommendation homogeneity +- **Solution**: Content-aware random walks +- **Benefit**: Improved diversity and serendipity + +### 5. Environment Management +- **Problem**: Dependency isolation and reproducibility +- **Solution**: Conda environment named `tw` +- **Benefit**: Consistent Python environment across development and production + +## 📚 Documentation Resources + +### Core Documentation +- **[offline_tasks/doc/详细设计文档.md](offline_tasks/doc/详细设计文档.md)** - Complete system architecture +- **[offline_tasks/doc/离线索引数据规范.md](offline_tasks/doc/离线索引数据规范.md)** - Data format specifications +- **[offline_tasks/doc/Redis数据规范.md](offline_tasks/doc/Redis数据规范.md)** - Redis integration guide +- **[offline_tasks/README.md](offline_tasks/README.md)** - Quick start guide + +### Algorithm Documentation +- **[graphembedding/deepwalk/README.md](graphembedding/deepwalk/README.md)** - DeepWalk with tag enhancements +- **[collaboration/README.md](collaboration/README.md)** - C++ Swing algorithm +- **[collaboration/Swing快速开始.md](collaboration/Swing快速开始.md)** - Swing implementation guide + +## 🚨 Important Notes for Development + +1. **Environment**: Uses Conda environment `tw` - activate before running +2. **Database**: Read-only access to external database +3. **Redis**: Local instance for development, configurable for production +4. **Memory**: Algorithms are memory-intensive - monitor usage +5. **Output**: All files include date stamps for versioning +6. **Testing**: Always test with small datasets before production runs + +## 🔗 Related Components + +- **Online Services**: Redis-based recommendation serving +- **Elasticsearch**: Vector similarity search capabilities +- **Frontend APIs**: Recommendation interfaces for different platforms +- **Monitoring**: Performance metrics and error tracking + +--- + +**Last Updated**: 2024-12-10 +**Maintained by**: Recommendation System Team +**Status**: Production-ready with active development diff --git a/索引缺失问题排查.md b/索引缺失问题排查.md new file mode 100644 index 0000000..081a8bd --- /dev/null +++ b/索引缺失问题排查.md @@ -0,0 +1,54 @@ +请你重点关注有哪些维度的索引,各自数据的来龙去脉。 +现在在线推荐服务推荐结果都为空,经过检查时离线索引数据为空,请你检查这些索引清单(含用户特征相关和兜底维度),方便你逐项去离线侧核对/补齐。Redis DB 默认:推荐用 db3,用户画像用 snapshot_db(app_config 中 Redis snapshot_db,默认0)。 +一、首页猜你喜欢 /recommendation/home +1) 用户行为 I2I(需要用户画像行为;无画像则这些路空) +item:similar:swing_cpp:{sku} (key_name=user_behavior_click/purchase) +item:similar:swing:{sku} (同上) +item:similar:session_w2v:{sku}(同上) +item:similar:deepwalk:{sku} (同上) +item:similar:content_name:{sku}(同上) +item:similar:content_pic:{sku} (同上) +2) 用户品牌偏好(需要画像品牌;无画像则空) +interest:hot:{brand_id} (key_name=user_brand_preference,模板 interest:hot:{key},key 为品牌ID字符串) +3) 用户类目偏好 ES 召回(需要画像类目;无画像则空) +直接查 ES sale_category_all,不依赖 Redis 索引。 +4) 用户行为类目 ES 召回(需要画像行为类目;无画像则空) +同上,查 ES,不依赖 Redis 索引。 +5) 兜底兴趣聚合(无画像也应有结果,取决于这些 key 是否存在) +平台热门:interest:hot:platform:{platform} 示例 interest:hot:platform:essaone +全局热门:interest:global:global(模板 interest:global:{key},这里 key=global) +若有客户端维度(目前配置中未直接用 client_key,但可检查是否有 interest:hot:client_platform:{client}、interest:hot:platform_client:{platform}_{client} 供扩展) +二、详情页大家都在看 /recommendation/detail +1) 当前商品 I2I(无画像也可用,只要有这些 key) +item:similar:swing_cpp:{sku_id} +item:similar:swing:{sku_id} +item:similar:session_w2v:{sku_id} +item:similar:deepwalk:{sku_id} +item:similar:content_name:{sku_id} +item:similar:content_pic:{sku_id} +2) 用户行为 I2I(需要用户画像行为;无画像则空) +同首页的行为 I2I 路径,依赖 item:similar:swing_cpp|swing|session_w2v|deepwalk|content_name|content_pic:{sku} +三、Redis Key 模板(统一定义) +I2I:item:similar:swing_cpp:{key} / swing / w2v / session_w2v / deepwalk / content_name / content_pic +兴趣聚合:interest:hot:{key}、interest:cart:{key}、interest:new:{key}、interest:global:{key} +四、需要检查/补齐的关键索引清单 +必须有(兜底与主干): +interest:hot:platform:essaone(或对应平台)——首页兜底 +interest:global:global —— 首页兜底 +item:similar:swing_cpp:{sku} —— 详情页主路、首页行为召回 +item:similar:swing:{sku} +建议有(覆盖率/多样性): +item:similar:session_w2v:{sku} +item:similar:deepwalk:{sku} +item:similar:content_name:{sku} +item:similar:content_pic:{sku} +可选/扩展兜底(若有产出): +interest:hot:client_platform:{client} +interest:hot:platform_client:{platform}_{client} +类目维度热门:interest:hot:category_level2:{id} 等(当前配置未直接用,但有数据可扩展) +其他兴趣列表:interest:cart:*、interest:new:* +五、用户画像数据 +Key 形如:user_profile:{user_id}(在 snapshot_db,默认 db0)。缺失画像会导致行为/偏好召回为空,但兜底热门仍可用,只要上述兴趣/全球热门 key 存在。 +你可以据此逐项在离线产出与 Redis db3/snapshot_db 里核对哪些 key 缺失,优先确保兜底热门与当前商品 I2I 存在 + +请你分析日志,为什么没结果,并且给出完善的方案 -- libgit2 0.21.2