Blame view

offline_tasks/README.md 5.3 KB
5ab1c29c   tangwang   first commit
1
2
  # 推荐系统离线任务
  
a1c26d3d   tangwang   add cpp swing for...
3
  推荐系统的离线索引生成模块,包含多种算法和数据处理任务。
5ab1c29c   tangwang   first commit
4
  
a1c26d3d   tangwang   add cpp swing for...
5
  ## 🚀 快速开始
5ab1c29c   tangwang   first commit
6
  
a1c26d3d   tangwang   add cpp swing for...
7
  ### 运行所有任务(推荐)
5ab1c29c   tangwang   first commit
8
  
5ab1c29c   tangwang   first commit
9
  ```bash
a1c26d3d   tangwang   add cpp swing for...
10
  cd /home/tw/recommendation/offline_tasks
5ab1c29c   tangwang   first commit
11
  
a1c26d3d   tangwang   add cpp swing for...
12
13
  # 运行全部离线任务(包括C++ Swing)
  python3 run_all.py
5ab1c29c   tangwang   first commit
14
  
a1c26d3d   tangwang   add cpp swing for...
15
16
  # 开启debug模式(详细日志 + 可读文件)
  python3 run_all.py --debug
5ab1c29c   tangwang   first commit
17
18
  ```
  
a1c26d3d   tangwang   add cpp swing for...
19
  ### 任务执行顺序
5ab1c29c   tangwang   first commit
20
  
a1c26d3d   tangwang   add cpp swing for...
21
22
23
24
25
  ```
  前置任务:
  1. fetch_item_attributes.py  → 获取商品属性映射
  2. generate_session.py       → 生成用户行为session
  3. C++ Swing算法             → 高性能i2i相似度计算
5ab1c29c   tangwang   first commit
26
  
a1c26d3d   tangwang   add cpp swing for...
27
28
29
30
31
32
  核心算法任务:
  4. Python Swing算法          → 支持日期维度的i2i
  5. Session W2V              → 基于序列的embedding
  6. DeepWalk                 → 图结构embedding
  7. 内容相似度               → 基于ES向量
  8. 兴趣聚合                 → 多维度商品聚合
5ab1c29c   tangwang   first commit
33
34
  ```
  
a1c26d3d   tangwang   add cpp swing for...
35
  ## 📚 文档
5ab1c29c   tangwang   first commit
36
  
a1c26d3d   tangwang   add cpp swing for...
37
  所有文档位于 **`doc/`** 目录:
5ab1c29c   tangwang   first commit
38
  
a1c26d3d   tangwang   add cpp swing for...
39
40
41
42
  - **[doc/快速开始.md](./doc/快速开始.md)** - 新手入门
  - **[doc/Swing算法使用指南.md](./doc/Swing算法使用指南.md)** - 详细使用
  - **[doc/系统改进总结-20241017.md](./doc/系统改进总结-20241017.md)** - 最新改进
  - **[doc/README.md](./doc/README.md)** - 完整文档索引
5ab1c29c   tangwang   first commit
43
  
a1c26d3d   tangwang   add cpp swing for...
44
  ## 🔧 核心功能
5ab1c29c   tangwang   first commit
45
  
a1c26d3d   tangwang   add cpp swing for...
46
  ### 1. 前置任务优化
5ab1c29c   tangwang   first commit
47
  
a1c26d3d   tangwang   add cpp swing for...
48
49
50
  - **商品属性缓存**: 一次获取,多次使用,减少90%数据库查询
  - **Session文件复用**: 统一生成,多算法共享
  - **C++ Swing集成**: 自动执行,高性能计算
5ab1c29c   tangwang   first commit
51
  
a1c26d3d   tangwang   add cpp swing for...
52
  ### 2. 算法增强
5ab1c29c   tangwang   first commit
53
  
a1c26d3d   tangwang   add cpp swing for...
54
55
56
  - **双维度Swing**: 同时考虑用户整体行为和单日行为
  - **时间衰减**: 可选的时间权重衰减
  - **Debug模式**: 自动生成可读版本(ID + 名称)
5ab1c29c   tangwang   first commit
57
  
a1c26d3d   tangwang   add cpp swing for...
58
  ### 3. 自动化流程
5ab1c29c   tangwang   first commit
59
  
5ab1c29c   tangwang   first commit
60
  ```bash
a1c26d3d   tangwang   add cpp swing for...
61
62
  # 一条命令完成所有任务
  python3 run_all.py --debug
5ab1c29c   tangwang   first commit
63
64
  ```
  
a1c26d3d   tangwang   add cpp swing for...
65
66
67
68
69
70
  输出文件:
  - `output/item_attributes_mappings.json` - ID映射
  - `output/session.txt.YYYYMMDD` - 用户session
  - `collaboration/output/swing_similar.txt` - C++ Swing结果
  - `output/i2i_swing_YYYYMMDD.txt` - Python Swing结果
  - ... 其他算法输出
5ab1c29c   tangwang   first commit
71
  
a1c26d3d   tangwang   add cpp swing for...
72
  ## 📊 性能对比
5ab1c29c   tangwang   first commit
73
  
a1c26d3d   tangwang   add cpp swing for...
74
75
76
77
78
  | 任务 | 改进前 | 改进后 | 提升 |
  |------|--------|--------|------|
  | 数据库查询 | 5-10次 | 1次 | 80-90% ↓ |
  | Swing性能 | Python | C++ | 10-100x ↑ |
  | 任务管理 | 手动分步 | 自动流程 | 100% ↑ |
5ab1c29c   tangwang   first commit
79
  
a1c26d3d   tangwang   add cpp swing for...
80
  ## 🛠️ 单独运行任务
5ab1c29c   tangwang   first commit
81
  
a1c26d3d   tangwang   add cpp swing for...
82
  ### 1. 获取商品属性
5ab1c29c   tangwang   first commit
83
  
a1c26d3d   tangwang   add cpp swing for...
84
85
  ```bash
  python3 scripts/fetch_item_attributes.py
5ab1c29c   tangwang   first commit
86
87
  ```
  
a1c26d3d   tangwang   add cpp swing for...
88
  ### 2. 生成Session
5ab1c29c   tangwang   first commit
89
  
5ab1c29c   tangwang   first commit
90
  ```bash
a1c26d3d   tangwang   add cpp swing for...
91
  python3 scripts/generate_session.py --lookback_days 730
5ab1c29c   tangwang   first commit
92
93
  ```
  
a1c26d3d   tangwang   add cpp swing for...
94
  ### 3. C++ Swing
5ab1c29c   tangwang   first commit
95
  
a1c26d3d   tangwang   add cpp swing for...
96
97
98
99
  ```bash
  cd ../collaboration
  bash run.sh
  ```
5ab1c29c   tangwang   first commit
100
  
a1c26d3d   tangwang   add cpp swing for...
101
  ### 4. Python Swing(支持日期维度)
5ab1c29c   tangwang   first commit
102
  
a1c26d3d   tangwang   add cpp swing for...
103
104
  ```bash
  python3 scripts/i2i_swing.py --lookback_days 730 --use_daily_session --debug
5ab1c29c   tangwang   first commit
105
106
  ```
  
a1c26d3d   tangwang   add cpp swing for...
107
  ### 5. 其他算法
5ab1c29c   tangwang   first commit
108
  
a1c26d3d   tangwang   add cpp swing for...
109
110
111
  ```bash
  # Session W2V
  python3 scripts/i2i_session_w2v.py --lookback_days 730 --debug
5ab1c29c   tangwang   first commit
112
  
a1c26d3d   tangwang   add cpp swing for...
113
114
  # DeepWalk
  python3 scripts/i2i_deepwalk.py --lookback_days 730 --debug
5ab1c29c   tangwang   first commit
115
  
a1c26d3d   tangwang   add cpp swing for...
116
117
  # 内容相似度
  python3 scripts/i2i_content_similar.py
5ab1c29c   tangwang   first commit
118
  
a1c26d3d   tangwang   add cpp swing for...
119
120
121
  # 兴趣聚合
  python3 scripts/interest_aggregation.py --lookback_days 730 --debug
  ```
5ab1c29c   tangwang   first commit
122
  
a1c26d3d   tangwang   add cpp swing for...
123
  ## 📁 项目结构
5ab1c29c   tangwang   first commit
124
125
  
  ```
a1c26d3d   tangwang   add cpp swing for...
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
  offline_tasks/
  ├── scripts/              # 所有任务脚本
  │   ├── fetch_item_attributes.py
  │   ├── generate_session.py
  │   ├── i2i_swing.py
  │   ├── i2i_session_w2v.py
  │   ├── i2i_deepwalk.py
  │   ├── i2i_content_similar.py
  │   ├── interest_aggregation.py
  │   ├── add_names_to_swing.py
  │   └── debug_utils.py
  ├── config/               # 配置文件
  │   └── offline_config.py
  ├── doc/                  # 文档中心
  │   ├── README.md
  │   ├── 快速开始.md
  │   ├── Swing算法使用指南.md
  │   └── ...
  ├── output/               # 输出目录
  │   ├── item_attributes_mappings.json
  │   ├── session.txt.*
  │   └── *.txt
  ├── logs/                 # 日志目录
  ├── run_all.py           # 统一入口
  └── README.md            # 本文件
  ```
  
  ## ⚙️ 配置
  
  配置文件:`config/offline_config.py`
  
  主要参数:
  ```python
  DEFAULT_LOOKBACK_DAYS = 730    # 数据回看天数
  DEFAULT_I2I_TOP_N = 50         # i2i推荐数量
  DEFAULT_INTEREST_TOP_N = 1000  # 兴趣聚合数量
  
  # 数据库配置
  DB_CONFIG = {...}
  
  # 算法参数
  I2I_CONFIG = {...}
  ```
  
  ## 🐛 故障排查
  
  ### 常见问题
  
  **1. 映射文件不存在**
  ```bash
  # 先运行前置任务
  python3 scripts/fetch_item_attributes.py
5ab1c29c   tangwang   first commit
178
179
  ```
  
a1c26d3d   tangwang   add cpp swing for...
180
181
182
183
184
  **2. Session文件找不到**
  ```bash
  # 生成session文件
  python3 scripts/generate_session.py
  ```
5ab1c29c   tangwang   first commit
185
  
a1c26d3d   tangwang   add cpp swing for...
186
  **3. C++ Swing编译失败**
5ab1c29c   tangwang   first commit
187
  ```bash
a1c26d3d   tangwang   add cpp swing for...
188
189
190
  cd ../collaboration
  make clean
  make
5ab1c29c   tangwang   first commit
191
192
  ```
  
a1c26d3d   tangwang   add cpp swing for...
193
  详见:[doc/故障排查指南.md](./doc/故障排查指南.md)
5ab1c29c   tangwang   first commit
194
  
a1c26d3d   tangwang   add cpp swing for...
195
  ## 📝 日志
5ab1c29c   tangwang   first commit
196
  
a1c26d3d   tangwang   add cpp swing for...
197
198
199
  日志位置:
  - 主日志:`logs/run_all_YYYYMMDD.log`
  - Debug日志:`logs/debug/*.log`
5ab1c29c   tangwang   first commit
200
  
a1c26d3d   tangwang   add cpp swing for...
201
202
203
  查看最新日志:
  ```bash
  tail -f logs/run_all_$(date +%Y%m%d).log
5ab1c29c   tangwang   first commit
204
205
  ```
  
a1c26d3d   tangwang   add cpp swing for...
206
  ## 🔗 相关项目
5ab1c29c   tangwang   first commit
207
  
a1c26d3d   tangwang   add cpp swing for...
208
209
210
211
  - **Collaboration**: `../collaboration/` - C++ 协同过滤
  - **GraphEmbedding**: `../graphembedding/` - 图embedding
  - **Hot**: `../hot/` - 热门推荐
  - **Frontend**: `../frontend/` - 推荐接口
5ab1c29c   tangwang   first commit
212
  
a1c26d3d   tangwang   add cpp swing for...
213
  ## 📞 更多信息
5ab1c29c   tangwang   first commit
214
  
a1c26d3d   tangwang   add cpp swing for...
215
216
217
  - **完整文档**: [doc/README.md](./doc/README.md)
  - **改进总结**: [doc/系统改进总结-20241017.md](./doc/系统改进总结-20241017.md)
  - **故障排查**: [doc/故障排查指南.md](./doc/故障排查指南.md)
5ab1c29c   tangwang   first commit
218
  
a1c26d3d   tangwang   add cpp swing for...
219
  ---
5ab1c29c   tangwang   first commit
220
  
a1c26d3d   tangwang   add cpp swing for...
221
222
  **最后更新**: 2024-10-17  
  **状态**: ✅ 生产就绪