Blame view

offline_tasks/README.md 5.67 KB
5ab1c29c   tangwang   first commit
1
2
  # 推荐系统离线任务
  
a1c26d3d   tangwang   add cpp swing for...
3
  推荐系统的离线索引生成模块,包含多种算法和数据处理任务。
5ab1c29c   tangwang   first commit
4
  
a1c26d3d   tangwang   add cpp swing for...
5
  ## 🚀 快速开始
5ab1c29c   tangwang   first commit
6
  
7e37f9e2   tangwang   add cpp swing for...
7
  ### 运行所有任务
5ab1c29c   tangwang   first commit
8
  
5ab1c29c   tangwang   first commit
9
  ```bash
a1c26d3d   tangwang   add cpp swing for...
10
  cd /home/tw/recommendation/offline_tasks
5ab1c29c   tangwang   first commit
11
  
7e37f9e2   tangwang   add cpp swing for...
12
13
  # ⭐ 推荐:使用 run.sh(完整流程,包含Redis加载)
  bash run.sh
5ab1c29c   tangwang   first commit
14
  
7e37f9e2   tangwang   add cpp swing for...
15
  # 备用:使用 run_all.py(简化版,不含C++ Swing和Redis)
a1c26d3d   tangwang   add cpp swing for...
16
  python3 run_all.py --debug
5ab1c29c   tangwang   first commit
17
18
  ```
  
7e37f9e2   tangwang   add cpp swing for...
19
20
21
22
  **说明**:
  - `run.sh`: 主执行脚本,包含完整流程、内存监控、自动Redis加载
  - `run_all.py`: Python简化版本,只包含Python算法任务
  
a1c26d3d   tangwang   add cpp swing for...
23
  ### 任务执行顺序
5ab1c29c   tangwang   first commit
24
  
a1c26d3d   tangwang   add cpp swing for...
25
26
  ```
  前置任务:
7e37f9e2   tangwang   add cpp swing for...
27
28
29
  1. fetch_item_attributes.py     → 获取商品属性映射
  2. generate_session.py          → 生成用户行为session
  3. collaboration/run.sh         → C++ Swing算法(高性能)
5ab1c29c   tangwang   first commit
30
  
a1c26d3d   tangwang   add cpp swing for...
31
  核心算法任务:
7e37f9e2   tangwang   add cpp swing for...
32
33
34
35
36
  4. i2i_swing.py                → Python Swing(支持日期维度)
  5. i2i_session_w2v.py          → Session W2V
  6. i2i_deepwalk.py             → DeepWalk
  7. i2i_content_similar.py      → 内容相似度
  8. interest_aggregation.py     → 兴趣聚合
5ab1c29c   tangwang   first commit
37
38
  ```
  
a1c26d3d   tangwang   add cpp swing for...
39
  ## 📚 文档
5ab1c29c   tangwang   first commit
40
  
a1c26d3d   tangwang   add cpp swing for...
41
  所有文档位于 **`doc/`** 目录:
5ab1c29c   tangwang   first commit
42
  
a1c26d3d   tangwang   add cpp swing for...
43
44
45
46
  - **[doc/快速开始.md](./doc/快速开始.md)** - 新手入门
  - **[doc/Swing算法使用指南.md](./doc/Swing算法使用指南.md)** - 详细使用
  - **[doc/系统改进总结-20241017.md](./doc/系统改进总结-20241017.md)** - 最新改进
  - **[doc/README.md](./doc/README.md)** - 完整文档索引
5ab1c29c   tangwang   first commit
47
  
a1c26d3d   tangwang   add cpp swing for...
48
  ## 🔧 核心功能
5ab1c29c   tangwang   first commit
49
  
a1c26d3d   tangwang   add cpp swing for...
50
  ### 1. 前置任务优化
5ab1c29c   tangwang   first commit
51
  
a1c26d3d   tangwang   add cpp swing for...
52
53
54
  - **商品属性缓存**: 一次获取,多次使用,减少90%数据库查询
  - **Session文件复用**: 统一生成,多算法共享
  - **C++ Swing集成**: 自动执行,高性能计算
5ab1c29c   tangwang   first commit
55
  
a1c26d3d   tangwang   add cpp swing for...
56
  ### 2. 算法增强
5ab1c29c   tangwang   first commit
57
  
a1c26d3d   tangwang   add cpp swing for...
58
59
60
  - **双维度Swing**: 同时考虑用户整体行为和单日行为
  - **时间衰减**: 可选的时间权重衰减
  - **Debug模式**: 自动生成可读版本(ID + 名称)
5ab1c29c   tangwang   first commit
61
  
a1c26d3d   tangwang   add cpp swing for...
62
  ### 3. 自动化流程
5ab1c29c   tangwang   first commit
63
  
5ab1c29c   tangwang   first commit
64
  ```bash
a1c26d3d   tangwang   add cpp swing for...
65
66
  # 一条命令完成所有任务
  python3 run_all.py --debug
5ab1c29c   tangwang   first commit
67
68
  ```
  
a1c26d3d   tangwang   add cpp swing for...
69
70
71
72
73
74
  输出文件:
  - `output/item_attributes_mappings.json` - ID映射
  - `output/session.txt.YYYYMMDD` - 用户session
  - `collaboration/output/swing_similar.txt` - C++ Swing结果
  - `output/i2i_swing_YYYYMMDD.txt` - Python Swing结果
  - ... 其他算法输出
5ab1c29c   tangwang   first commit
75
  
a1c26d3d   tangwang   add cpp swing for...
76
  ## 📊 性能对比
5ab1c29c   tangwang   first commit
77
  
a1c26d3d   tangwang   add cpp swing for...
78
79
80
81
82
  | 任务 | 改进前 | 改进后 | 提升 |
  |------|--------|--------|------|
  | 数据库查询 | 5-10次 | 1次 | 80-90% ↓ |
  | Swing性能 | Python | C++ | 10-100x ↑ |
  | 任务管理 | 手动分步 | 自动流程 | 100% ↑ |
5ab1c29c   tangwang   first commit
83
  
a1c26d3d   tangwang   add cpp swing for...
84
  ## 🛠️ 单独运行任务
5ab1c29c   tangwang   first commit
85
  
a1c26d3d   tangwang   add cpp swing for...
86
  ### 1. 获取商品属性
5ab1c29c   tangwang   first commit
87
  
a1c26d3d   tangwang   add cpp swing for...
88
89
  ```bash
  python3 scripts/fetch_item_attributes.py
5ab1c29c   tangwang   first commit
90
91
  ```
  
a1c26d3d   tangwang   add cpp swing for...
92
  ### 2. 生成Session
5ab1c29c   tangwang   first commit
93
  
5ab1c29c   tangwang   first commit
94
  ```bash
a1c26d3d   tangwang   add cpp swing for...
95
  python3 scripts/generate_session.py --lookback_days 730
5ab1c29c   tangwang   first commit
96
97
  ```
  
a1c26d3d   tangwang   add cpp swing for...
98
  ### 3. C++ Swing
5ab1c29c   tangwang   first commit
99
  
a1c26d3d   tangwang   add cpp swing for...
100
  ```bash
7e37f9e2   tangwang   add cpp swing for...
101
  cd collaboration
a1c26d3d   tangwang   add cpp swing for...
102
103
  bash run.sh
  ```
5ab1c29c   tangwang   first commit
104
  
a1c26d3d   tangwang   add cpp swing for...
105
  ### 4. Python Swing(支持日期维度)
5ab1c29c   tangwang   first commit
106
  
a1c26d3d   tangwang   add cpp swing for...
107
108
  ```bash
  python3 scripts/i2i_swing.py --lookback_days 730 --use_daily_session --debug
5ab1c29c   tangwang   first commit
109
110
  ```
  
a1c26d3d   tangwang   add cpp swing for...
111
  ### 5. 其他算法
5ab1c29c   tangwang   first commit
112
  
a1c26d3d   tangwang   add cpp swing for...
113
114
115
  ```bash
  # Session W2V
  python3 scripts/i2i_session_w2v.py --lookback_days 730 --debug
5ab1c29c   tangwang   first commit
116
  
a1c26d3d   tangwang   add cpp swing for...
117
118
  # DeepWalk
  python3 scripts/i2i_deepwalk.py --lookback_days 730 --debug
5ab1c29c   tangwang   first commit
119
  
a1c26d3d   tangwang   add cpp swing for...
120
121
  # 内容相似度
  python3 scripts/i2i_content_similar.py
5ab1c29c   tangwang   first commit
122
  
a1c26d3d   tangwang   add cpp swing for...
123
124
125
  # 兴趣聚合
  python3 scripts/interest_aggregation.py --lookback_days 730 --debug
  ```
5ab1c29c   tangwang   first commit
126
  
a1c26d3d   tangwang   add cpp swing for...
127
  ## 📁 项目结构
5ab1c29c   tangwang   first commit
128
129
  
  ```
a1c26d3d   tangwang   add cpp swing for...
130
131
132
133
134
135
136
137
138
139
140
  offline_tasks/
  ├── scripts/              # 所有任务脚本
  │   ├── fetch_item_attributes.py
  │   ├── generate_session.py
  │   ├── i2i_swing.py
  │   ├── i2i_session_w2v.py
  │   ├── i2i_deepwalk.py
  │   ├── i2i_content_similar.py
  │   ├── interest_aggregation.py
  │   ├── add_names_to_swing.py
  │   └── debug_utils.py
7e37f9e2   tangwang   add cpp swing for...
141
142
143
144
145
  ├── collaboration/        # C++ Swing算法
  │   ├── src/
  │   ├── bin/
  │   ├── run.sh
  │   └── output/
a1c26d3d   tangwang   add cpp swing for...
146
147
148
149
150
151
152
153
154
155
156
157
  ├── config/               # 配置文件
  │   └── offline_config.py
  ├── doc/                  # 文档中心
  │   ├── README.md
  │   ├── 快速开始.md
  │   ├── Swing算法使用指南.md
  │   └── ...
  ├── output/               # 输出目录
  │   ├── item_attributes_mappings.json
  │   ├── session.txt.*
  │   └── *.txt
  ├── logs/                 # 日志目录
7e37f9e2   tangwang   add cpp swing for...
158
159
  ├── run.sh               # 主执行脚本(推荐)
  ├── run_all.py           # Python版本(简化)
a1c26d3d   tangwang   add cpp swing for...
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
  └── README.md            # 本文件
  ```
  
  ## ⚙️ 配置
  
  配置文件:`config/offline_config.py`
  
  主要参数:
  ```python
  DEFAULT_LOOKBACK_DAYS = 730    # 数据回看天数
  DEFAULT_I2I_TOP_N = 50         # i2i推荐数量
  DEFAULT_INTEREST_TOP_N = 1000  # 兴趣聚合数量
  
  # 数据库配置
  DB_CONFIG = {...}
  
  # 算法参数
  I2I_CONFIG = {...}
  ```
  
  ## 🐛 故障排查
  
  ### 常见问题
  
  **1. 映射文件不存在**
  ```bash
  # 先运行前置任务
  python3 scripts/fetch_item_attributes.py
5ab1c29c   tangwang   first commit
188
189
  ```
  
a1c26d3d   tangwang   add cpp swing for...
190
191
192
193
194
  **2. Session文件找不到**
  ```bash
  # 生成session文件
  python3 scripts/generate_session.py
  ```
5ab1c29c   tangwang   first commit
195
  
a1c26d3d   tangwang   add cpp swing for...
196
  **3. C++ Swing编译失败**
5ab1c29c   tangwang   first commit
197
  ```bash
7e37f9e2   tangwang   add cpp swing for...
198
  cd collaboration
a1c26d3d   tangwang   add cpp swing for...
199
200
  make clean
  make
5ab1c29c   tangwang   first commit
201
202
  ```
  
a1c26d3d   tangwang   add cpp swing for...
203
  详见:[doc/故障排查指南.md](./doc/故障排查指南.md)
5ab1c29c   tangwang   first commit
204
  
a1c26d3d   tangwang   add cpp swing for...
205
  ## 📝 日志
5ab1c29c   tangwang   first commit
206
  
a1c26d3d   tangwang   add cpp swing for...
207
208
209
  日志位置:
  - 主日志:`logs/run_all_YYYYMMDD.log`
  - Debug日志:`logs/debug/*.log`
5ab1c29c   tangwang   first commit
210
  
a1c26d3d   tangwang   add cpp swing for...
211
212
213
  查看最新日志:
  ```bash
  tail -f logs/run_all_$(date +%Y%m%d).log
5ab1c29c   tangwang   first commit
214
215
  ```
  
a1c26d3d   tangwang   add cpp swing for...
216
  ## 🔗 相关项目
5ab1c29c   tangwang   first commit
217
  
a1c26d3d   tangwang   add cpp swing for...
218
219
220
221
  - **Collaboration**: `../collaboration/` - C++ 协同过滤
  - **GraphEmbedding**: `../graphembedding/` - 图embedding
  - **Hot**: `../hot/` - 热门推荐
  - **Frontend**: `../frontend/` - 推荐接口
5ab1c29c   tangwang   first commit
222
  
a1c26d3d   tangwang   add cpp swing for...
223
  ## 📞 更多信息
5ab1c29c   tangwang   first commit
224
  
a1c26d3d   tangwang   add cpp swing for...
225
226
227
  - **完整文档**: [doc/README.md](./doc/README.md)
  - **改进总结**: [doc/系统改进总结-20241017.md](./doc/系统改进总结-20241017.md)
  - **故障排查**: [doc/故障排查指南.md](./doc/故障排查指南.md)
5ab1c29c   tangwang   first commit
228
  
a1c26d3d   tangwang   add cpp swing for...
229
  ---
5ab1c29c   tangwang   first commit
230
  
a1c26d3d   tangwang   add cpp swing for...
231
232
  **最后更新**: 2024-10-17  
  **状态**: ✅ 生产就绪