Commit b57c6eb4c77fd9b8af3fffbb4557a8ed30047d71
1 parent
1d82b555
offline tasks: fix bugs of i2i swing / hot / sessionw2v
Showing
9 changed files
with
1697 additions
and
362 deletions
Show diff stats
| ... | ... | @@ -0,0 +1,326 @@ |
| 1 | +# 离线任务更新总结 | |
| 2 | + | |
| 3 | +## 更新日期 | |
| 4 | +2025-10-17 | |
| 5 | + | |
| 6 | +## 更新内容 | |
| 7 | + | |
| 8 | +### 1. 重构内容相似索引 (`i2i_content_similar.py`) | |
| 9 | + | |
| 10 | +#### 变化 | |
| 11 | +- **从**: 基于数据库商品属性计算(TF-IDF + 余弦相似度) | |
| 12 | +- **到**: 基于Elasticsearch向量计算(KNN查询) | |
| 13 | + | |
| 14 | +#### 简化 | |
| 15 | +- **移除**: 所有命令行参数(`--method`, `--top_n`, `--output`, `--debug`) | |
| 16 | +- **保留**: 无参数,配置内置在代码中 | |
| 17 | +- **生成**: 两份索引文件(名称向量 + 图片向量) | |
| 18 | + | |
| 19 | +### 2. 简化运行脚本 (`run_all.py`) | |
| 20 | + | |
| 21 | +#### 移除的参数 | |
| 22 | +- `--skip-i2i` - 跳过i2i任务 | |
| 23 | +- `--skip-interest` - 跳过兴趣聚合 | |
| 24 | +- `--only-swing` - 只运行Swing | |
| 25 | +- `--only-w2v` - 只运行W2V | |
| 26 | +- `--only-deepwalk` - 只运行DeepWalk | |
| 27 | +- `--only-content` - 只运行内容相似 | |
| 28 | +- `--only-interest` - 只运行兴趣聚合 | |
| 29 | +- `--lookback_days` - 回看天数 | |
| 30 | +- `--top_n` - Top N数量 | |
| 31 | + | |
| 32 | +#### 保留的参数 | |
| 33 | +- `--debug` - 调试模式(唯一参数) | |
| 34 | + | |
| 35 | +#### 使用 | |
| 36 | +```bash | |
| 37 | +# 之前 | |
| 38 | +python run_all.py --lookback_days 30 --top_n 50 --skip-interest | |
| 39 | + | |
| 40 | +# 现在 | |
| 41 | +python run_all.py | |
| 42 | +# 或者 | |
| 43 | +python run_all.py --debug | |
| 44 | +``` | |
| 45 | + | |
| 46 | +### 3. 更新Redis数据规范 (`REDIS_DATA_SPEC.md`) | |
| 47 | + | |
| 48 | +#### 新增索引 | |
| 49 | +- `i2i_content_name`: 基于名称向量的相似索引 | |
| 50 | +- `i2i_content_pic`: 基于图片向量的相似索引 | |
| 51 | + | |
| 52 | +#### 更新统计 | |
| 53 | +- Key数量: 245,000 → 270,000 | |
| 54 | +- 总内存: ~135MB → ~160MB | |
| 55 | + | |
| 56 | +### 4. 更新索引加载器 (`load_index_to_redis.py`) | |
| 57 | + | |
| 58 | +#### 更新 | |
| 59 | +- 添加 `content_name` 到i2i索引类型列表 | |
| 60 | +- 添加 `content_pic` 到i2i索引类型列表 | |
| 61 | +- 自动加载两个新的内容相似索引 | |
| 62 | + | |
| 63 | +### 5. 更新依赖 (`requirements.txt`) | |
| 64 | + | |
| 65 | +#### 新增 | |
| 66 | +``` | |
| 67 | +elasticsearch>=8.0.0 | |
| 68 | +``` | |
| 69 | + | |
| 70 | +### 6. 新增文档 | |
| 71 | + | |
| 72 | +#### ES向量相似度说明 (`ES_VECTOR_SIMILARITY.md`) | |
| 73 | +- ES配置说明 | |
| 74 | +- 工作流程详解 | |
| 75 | +- 性能说明和优化建议 | |
| 76 | +- 故障排查指南 | |
| 77 | + | |
| 78 | +#### 更新说明 (`CONTENT_SIMILARITY_UPDATE.md`) | |
| 79 | +- 更新概述 | |
| 80 | +- 主要变化 | |
| 81 | +- 使用指南 | |
| 82 | +- 技术细节 | |
| 83 | +- 性能说明 | |
| 84 | +- 与其他算法对比 | |
| 85 | + | |
| 86 | +#### 本文档 (`CHANGES_SUMMARY.md`) | |
| 87 | +- 所有变更的简要总结 | |
| 88 | + | |
| 89 | +### 7. 新增测试脚本 (`test_es_connection.py`) | |
| 90 | + | |
| 91 | +测试ES连接和向量查询功能: | |
| 92 | +- 测试ES连接 | |
| 93 | +- 测试索引是否存在 | |
| 94 | +- 测试向量字段映射 | |
| 95 | +- 测试查询商品向量 | |
| 96 | +- 测试KNN向量查询 | |
| 97 | + | |
| 98 | +## 文件清单 | |
| 99 | + | |
| 100 | +### 修改的文件 | |
| 101 | +1. ✅ `offline_tasks/scripts/i2i_content_similar.py` - 完全重写 | |
| 102 | +2. ✅ `offline_tasks/run_all.py` - 简化参数 | |
| 103 | +3. ✅ `offline_tasks/REDIS_DATA_SPEC.md` - 更新规范 | |
| 104 | +4. ✅ `offline_tasks/scripts/load_index_to_redis.py` - 添加新索引类型 | |
| 105 | +5. ✅ `requirements.txt` - 添加elasticsearch依赖 | |
| 106 | + | |
| 107 | +### 新增的文件 | |
| 108 | +6. ✅ `offline_tasks/scripts/ES_VECTOR_SIMILARITY.md` - ES向量说明 | |
| 109 | +7. ✅ `offline_tasks/CONTENT_SIMILARITY_UPDATE.md` - 更新说明 | |
| 110 | +8. ✅ `offline_tasks/CHANGES_SUMMARY.md` - 本文档 | |
| 111 | +9. ✅ `offline_tasks/scripts/test_es_connection.py` - ES测试脚本 | |
| 112 | + | |
| 113 | +## 使用指南 | |
| 114 | + | |
| 115 | +### 1. 安装依赖 | |
| 116 | + | |
| 117 | +```bash | |
| 118 | +pip install -r requirements.txt | |
| 119 | +``` | |
| 120 | + | |
| 121 | +### 2. 测试ES连接 | |
| 122 | + | |
| 123 | +```bash | |
| 124 | +cd /home/tw/recommendation/offline_tasks | |
| 125 | +python scripts/test_es_connection.py | |
| 126 | +``` | |
| 127 | + | |
| 128 | +### 3. 运行内容相似索引生成 | |
| 129 | + | |
| 130 | +```bash | |
| 131 | +# 单独运行 | |
| 132 | +python scripts/i2i_content_similar.py | |
| 133 | + | |
| 134 | +# 或通过run_all运行所有任务 | |
| 135 | +python run_all.py | |
| 136 | +``` | |
| 137 | + | |
| 138 | +### 4. 加载到Redis | |
| 139 | + | |
| 140 | +```bash | |
| 141 | +python scripts/load_index_to_redis.py | |
| 142 | +``` | |
| 143 | + | |
| 144 | +## 输出文件 | |
| 145 | + | |
| 146 | +### 新增的输出文件 | |
| 147 | +- `output/i2i_content_name_YYYYMMDD.txt` - 名称向量相似索引 | |
| 148 | +- `output/i2i_content_pic_YYYYMMDD.txt` - 图片向量相似索引 | |
| 149 | + | |
| 150 | +### 文件格式 | |
| 151 | +``` | |
| 152 | +item_id \t item_name \t similar_id1:score1,similar_id2:score2,... | |
| 153 | +``` | |
| 154 | + | |
| 155 | +### 示例 | |
| 156 | +``` | |
| 157 | +3302275 香蕉干 3302276:0.9234,3302277:0.8756,3302278:0.8432 | |
| 158 | +``` | |
| 159 | + | |
| 160 | +## Redis Keys | |
| 161 | + | |
| 162 | +### 新增的Key格式 | |
| 163 | +``` | |
| 164 | +item:similar:content_name:{item_id} | |
| 165 | +item:similar:content_pic:{item_id} | |
| 166 | +``` | |
| 167 | + | |
| 168 | +### Value格式 | |
| 169 | +```json | |
| 170 | +[[similar_id1,score1],[similar_id2,score2],...] | |
| 171 | +``` | |
| 172 | + | |
| 173 | +### 查询示例 | |
| 174 | +```python | |
| 175 | +import redis | |
| 176 | +import json | |
| 177 | + | |
| 178 | +r = redis.Redis(host='localhost', port=6379, db=0) | |
| 179 | + | |
| 180 | +# 名称向量相似 | |
| 181 | +similar = json.loads(r.get('item:similar:content_name:3302275')) | |
| 182 | +# 返回: [[3302276, 0.9234], [3302277, 0.8756], ...] | |
| 183 | + | |
| 184 | +# 图片向量相似 | |
| 185 | +similar = json.loads(r.get('item:similar:content_pic:3302275')) | |
| 186 | +# 返回: [[4503826, 0.8123], [4503827, 0.7856], ...] | |
| 187 | +``` | |
| 188 | + | |
| 189 | +## 性能指标 | |
| 190 | + | |
| 191 | +### 内容相似索引生成 | |
| 192 | +- 活跃商品: ~50,000 | |
| 193 | +- 运行时间: 50-60分钟 | |
| 194 | +- 内存占用: < 2GB | |
| 195 | + | |
| 196 | +### Redis存储 | |
| 197 | +- 新增Keys: ~100,000 (两份索引各50,000) | |
| 198 | +- 新增内存: ~50MB | |
| 199 | +- TTL: 30天 | |
| 200 | + | |
| 201 | +## 兼容性 | |
| 202 | + | |
| 203 | +### 向后兼容 | |
| 204 | +- ✅ 其他i2i算法(Swing, W2V, DeepWalk)不受影响 | |
| 205 | +- ✅ 兴趣聚合算法不受影响 | |
| 206 | +- ✅ Redis加载器向后兼容 | |
| 207 | +- ✅ 在线查询API不受影响 | |
| 208 | + | |
| 209 | +### 不兼容的变化 | |
| 210 | +- ❌ `i2i_content_similar.py` 命令行参数全部改变 | |
| 211 | +- ❌ 旧的 `i2i_content_hybrid_*.txt` 文件不再生成 | |
| 212 | + | |
| 213 | +## 迁移指南 | |
| 214 | + | |
| 215 | +### 如果之前使用了内容相似索引 | |
| 216 | + | |
| 217 | +1. **更新脚本调用** | |
| 218 | + ```bash | |
| 219 | + # 旧版本 | |
| 220 | + python i2i_content_similar.py --top_n 50 --method hybrid | |
| 221 | + | |
| 222 | + # 新版本 | |
| 223 | + python i2i_content_similar.py # 无需参数 | |
| 224 | + ``` | |
| 225 | + | |
| 226 | +2. **更新Redis Key** | |
| 227 | + ```python | |
| 228 | + # 旧版本 | |
| 229 | + r.get('item:similar:content:{item_id}') | |
| 230 | + | |
| 231 | + # 新版本(两个选择) | |
| 232 | + r.get('item:similar:content_name:{item_id}') # 名称相似 | |
| 233 | + r.get('item:similar:content_pic:{item_id}') # 图片相似 | |
| 234 | + ``` | |
| 235 | + | |
| 236 | +3. **更新在线API** | |
| 237 | + - 如果API使用了 `content` 算法,需要更新为 `content_name` 或 `content_pic` | |
| 238 | + - 建议支持两种算法,让前端选择或混合使用 | |
| 239 | + | |
| 240 | +## 技术栈 | |
| 241 | + | |
| 242 | +### 新增技术 | |
| 243 | +- **Elasticsearch**: 向量存储和KNN查询 | |
| 244 | +- **KNN算法**: 基于向量的相似度计算 | |
| 245 | + | |
| 246 | +### ES配置 | |
| 247 | +```python | |
| 248 | +ES_CONFIG = { | |
| 249 | + 'host': 'http://localhost:9200', | |
| 250 | + 'index_name': 'spu', | |
| 251 | + 'username': 'essa', | |
| 252 | + 'password': '4hOaLaf41y2VuI8y' | |
| 253 | +} | |
| 254 | +``` | |
| 255 | + | |
| 256 | +### 向量字段 | |
| 257 | +- `embedding_name_zh`: 名称文本向量 (1024维, dot_product) | |
| 258 | +- `embedding_pic_h14.vector`: 图片向量 (1024维, dot_product) | |
| 259 | + | |
| 260 | +## 优势 | |
| 261 | + | |
| 262 | +### 相比旧版本 | |
| 263 | +1. **更简单**: 无需参数配置 | |
| 264 | +2. **更快**: ES KNN查询比TF-IDF快 | |
| 265 | +3. **更准**: 深度学习向量比手工特征准 | |
| 266 | +4. **更多维度**: 名称 + 图片两个维度 | |
| 267 | + | |
| 268 | +### 使用场景 | |
| 269 | +- **名称向量**: 语义相似推荐(同类但不同品牌) | |
| 270 | +- **图片向量**: 视觉相似推荐(外观相似商品) | |
| 271 | + | |
| 272 | +## 注意事项 | |
| 273 | + | |
| 274 | +1. **ES依赖**: 需要Elasticsearch服务可用 | |
| 275 | +2. **向量数据**: 需要ES中有向量数据 | |
| 276 | +3. **网络延迟**: ES查询受网络影响 | |
| 277 | +4. **首次运行**: 可能较慢,建议先测试连接 | |
| 278 | + | |
| 279 | +## 故障排查 | |
| 280 | + | |
| 281 | +### ES连接失败 | |
| 282 | +```bash | |
| 283 | +# 检查ES是否可访问 | |
| 284 | +curl -u essa:4hOaLaf41y2VuI8y http://localhost:9200 | |
| 285 | + | |
| 286 | +# 运行测试脚本 | |
| 287 | +python scripts/test_es_connection.py | |
| 288 | +``` | |
| 289 | + | |
| 290 | +### 向量字段不存在 | |
| 291 | +```bash | |
| 292 | +# 检查ES mapping | |
| 293 | +curl -u essa:4hOaLaf41y2VuI8y http://localhost:9200/spu/_mapping | |
| 294 | +``` | |
| 295 | + | |
| 296 | +### 查询超时 | |
| 297 | +- 增加 `request_timeout` 参数 | |
| 298 | +- 检查网络连接 | |
| 299 | +- 减少 `KNN_CANDIDATES` 参数 | |
| 300 | + | |
| 301 | +## 后续优化 | |
| 302 | + | |
| 303 | +1. **批量查询**: 使用 `_mget` 批量获取向量 | |
| 304 | +2. **并发处理**: 多线程提高查询效率 | |
| 305 | +3. **增量更新**: 只处理变化的商品 | |
| 306 | +4. **缓存向量**: 避免重复查询ES | |
| 307 | +5. **监控告警**: 添加性能监控和异常告警 | |
| 308 | + | |
| 309 | +## 相关文档 | |
| 310 | + | |
| 311 | +- `ES_VECTOR_SIMILARITY.md` - ES向量详细说明 | |
| 312 | +- `CONTENT_SIMILARITY_UPDATE.md` - 更新详细说明 | |
| 313 | +- `REDIS_DATA_SPEC.md` - Redis数据规范 | |
| 314 | +- `README.md` - 项目概述 | |
| 315 | + | |
| 316 | +## 总结 | |
| 317 | + | |
| 318 | +本次更新大幅简化了内容相似索引的使用,从基于属性的相似度改为基于深度学习向量的相似度,提供了更准确和多维度的相似商品推荐。同时简化了参数配置,降低了使用和维护成本。 | |
| 319 | + | |
| 320 | +--- | |
| 321 | + | |
| 322 | +**变更**: 9个文件(5个修改,4个新增) | |
| 323 | +**影响**: 内容相似索引生成和使用方式 | |
| 324 | +**破坏性**: 中等(API兼容,但参数和Key格式改变) | |
| 325 | +**优先级**: 高(建议尽快更新) | |
| 326 | + | ... | ... |
| ... | ... | @@ -0,0 +1,243 @@ |
| 1 | +# 内容相似索引更新说明 | |
| 2 | + | |
| 3 | +## 📋 更新概述 | |
| 4 | + | |
| 5 | +重构了 `i2i_content_similar.py`,从基于数据库属性计算改为基于Elasticsearch向量计算,生成两份内容相似索引。 | |
| 6 | + | |
| 7 | +## 🔄 主要变化 | |
| 8 | + | |
| 9 | +### 1. 算法改变 | |
| 10 | + | |
| 11 | +**之前 (旧版本):** | |
| 12 | +- 基于商品属性(分类、供应商、包装等) | |
| 13 | +- 使用TF-IDF + 余弦相似度 | |
| 14 | +- 提供 `--method` 参数选择: tfidf / category / hybrid | |
| 15 | +- 复杂的参数配置 | |
| 16 | + | |
| 17 | +**现在 (新版本):** | |
| 18 | +- 基于Elasticsearch的向量相似度 | |
| 19 | +- 使用KNN向量查询 | |
| 20 | +- **无需任何参数,开箱即用** | |
| 21 | +- 自动生成两份索引 | |
| 22 | + | |
| 23 | +### 2. 生成的索引 | |
| 24 | + | |
| 25 | +| 索引名称 | 向量来源 | 文件名 | Redis Key | TTL | | |
| 26 | +|---------|---------|--------|-----------|-----| | |
| 27 | +| **名称向量相似** | `embedding_name_zh` | `i2i_content_name_YYYYMMDD.txt` | `item:similar:content_name:{item_id}` | 30天 | | |
| 28 | +| **图片向量相似** | `embedding_pic_h14` | `i2i_content_pic_YYYYMMDD.txt` | `item:similar:content_pic:{item_id}` | 30天 | | |
| 29 | + | |
| 30 | +### 3. 参数简化 | |
| 31 | + | |
| 32 | +**之前:** | |
| 33 | +```bash | |
| 34 | +python i2i_content_similar.py --top_n 50 --method hybrid --debug | |
| 35 | +``` | |
| 36 | + | |
| 37 | +**现在:** | |
| 38 | +```bash | |
| 39 | +python i2i_content_similar.py | |
| 40 | +# 就这么简单!无需任何参数 | |
| 41 | +``` | |
| 42 | + | |
| 43 | +所有配置都在代码中预设好: | |
| 44 | +- `TOP_N = 50`: 返回前50个相似商品 | |
| 45 | +- `KNN_K = 100`: KNN查询返回100个候选 | |
| 46 | +- `KNN_CANDIDATES = 200`: 候选池大小200 | |
| 47 | + | |
| 48 | +## 📝 更新的文件 | |
| 49 | + | |
| 50 | +### 核心代码 | |
| 51 | +1. ✅ `offline_tasks/scripts/i2i_content_similar.py` - 完全重写 | |
| 52 | + - 连接Elasticsearch | |
| 53 | + - 查询最近1年活跃商品 | |
| 54 | + - 获取向量并计算相似度 | |
| 55 | + - 生成两份索引 | |
| 56 | + | |
| 57 | +### 配置文档 | |
| 58 | +2. ✅ `offline_tasks/REDIS_DATA_SPEC.md` - 更新Redis数据规范 | |
| 59 | + - 添加 `i2i_content_name` 规范 | |
| 60 | + - 添加 `i2i_content_pic` 规范 | |
| 61 | + - 更新内存占用估算(270,000 keys, ~160MB) | |
| 62 | + | |
| 63 | +### 调度脚本 | |
| 64 | +3. ✅ `offline_tasks/run_all.py` - 简化参数 | |
| 65 | + - 移除 `--only-*` 参数 | |
| 66 | + - 移除 `--skip-*` 参数 | |
| 67 | + - 移除 `--lookback_days` 和 `--top_n` 参数 | |
| 68 | + - 只保留 `--debug` 参数 | |
| 69 | + - 添加内容相似任务 | |
| 70 | + | |
| 71 | +4. ✅ `offline_tasks/scripts/load_index_to_redis.py` - 更新加载逻辑 | |
| 72 | + - 添加 `content_name` 和 `content_pic` 到索引类型列表 | |
| 73 | + | |
| 74 | +### 新增文档 | |
| 75 | +5. ✅ `offline_tasks/scripts/ES_VECTOR_SIMILARITY.md` - ES向量相似度说明 | |
| 76 | + - ES配置说明 | |
| 77 | + - 工作流程详解 | |
| 78 | + - 性能说明和优化建议 | |
| 79 | + - 故障排查指南 | |
| 80 | + | |
| 81 | +6. ✅ `offline_tasks/CONTENT_SIMILARITY_UPDATE.md` - 本文档 | |
| 82 | + | |
| 83 | +## 🚀 使用指南 | |
| 84 | + | |
| 85 | +### 安装依赖 | |
| 86 | + | |
| 87 | +需要安装Elasticsearch客户端: | |
| 88 | +```bash | |
| 89 | +pip install elasticsearch | |
| 90 | +``` | |
| 91 | + | |
| 92 | +### 配置ES连接 | |
| 93 | + | |
| 94 | +在 `i2i_content_similar.py` 中修改ES配置(如需要): | |
| 95 | +```python | |
| 96 | +ES_CONFIG = { | |
| 97 | + 'host': 'http://localhost:9200', | |
| 98 | + 'index_name': 'spu', | |
| 99 | + 'username': 'essa', | |
| 100 | + 'password': '4hOaLaf41y2VuI8y' | |
| 101 | +} | |
| 102 | +``` | |
| 103 | + | |
| 104 | +### 运行脚本 | |
| 105 | + | |
| 106 | +#### 单独运行 | |
| 107 | +```bash | |
| 108 | +cd /home/tw/recommendation/offline_tasks | |
| 109 | +python scripts/i2i_content_similar.py | |
| 110 | +``` | |
| 111 | + | |
| 112 | +#### 通过run_all运行 | |
| 113 | +```bash | |
| 114 | +python run_all.py | |
| 115 | +``` | |
| 116 | + | |
| 117 | +### 加载到Redis | |
| 118 | +```bash | |
| 119 | +python scripts/load_index_to_redis.py --date 20251017 | |
| 120 | +``` | |
| 121 | + | |
| 122 | +## 📊 输出示例 | |
| 123 | + | |
| 124 | +### 文件格式 | |
| 125 | +``` | |
| 126 | +3302275 香蕉干 3302276:0.9234,3302277:0.8756,3302278:0.8432 | |
| 127 | +``` | |
| 128 | + | |
| 129 | +### Redis存储 | |
| 130 | +```python | |
| 131 | +# 名称向量相似 | |
| 132 | +GET item:similar:content_name:3302275 | |
| 133 | +# 返回: [[3302276,0.9234],[3302277,0.8756],[3302278,0.8432]] | |
| 134 | + | |
| 135 | +# 图片向量相似 | |
| 136 | +GET item:similar:content_pic:3302275 | |
| 137 | +# 返回: [[4503826,0.8123],[4503827,0.7856],[4503828,0.7645]] | |
| 138 | +``` | |
| 139 | + | |
| 140 | +## 🔍 技术细节 | |
| 141 | + | |
| 142 | +### 数据源 | |
| 143 | + | |
| 144 | +1. **活跃商品列表** | |
| 145 | + - 来源: 数据库 `sensors_events` 表 | |
| 146 | + - 条件: 最近1年内有行为记录 | |
| 147 | + - 行为类型: click, contactFactory, addToPool, addToCart, purchase | |
| 148 | + | |
| 149 | +2. **向量数据** | |
| 150 | + - 来源: Elasticsearch `spu` 索引 | |
| 151 | + - 字段: | |
| 152 | + - `embedding_name_zh`: 名称文本向量 (1024维) | |
| 153 | + - `embedding_pic_h14.vector`: 图片向量 (1024维) | |
| 154 | + - `name_zh`: 商品中文名称 | |
| 155 | + | |
| 156 | +### ES查询 | |
| 157 | + | |
| 158 | +#### 1. 获取商品向量 | |
| 159 | +```json | |
| 160 | +{ | |
| 161 | + "query": { | |
| 162 | + "term": {"_id": "商品ID"} | |
| 163 | + }, | |
| 164 | + "_source": { | |
| 165 | + "includes": ["_id", "name_zh", "embedding_name_zh", "embedding_pic_h14"] | |
| 166 | + } | |
| 167 | +} | |
| 168 | +``` | |
| 169 | + | |
| 170 | +#### 2. KNN相似度查询 | |
| 171 | +```json | |
| 172 | +{ | |
| 173 | + "knn": { | |
| 174 | + "field": "embedding_name_zh", | |
| 175 | + "query_vector": [向量], | |
| 176 | + "k": 100, | |
| 177 | + "num_candidates": 200 | |
| 178 | + }, | |
| 179 | + "_source": ["_id", "name_zh"], | |
| 180 | + "size": 100 | |
| 181 | +} | |
| 182 | +``` | |
| 183 | + | |
| 184 | +## ⚡ 性能说明 | |
| 185 | + | |
| 186 | +### 运行时间 | |
| 187 | +- 活跃商品数: ~50,000 | |
| 188 | +- 向量查询: ~50,000次 × 10ms = 8-10分钟 | |
| 189 | +- KNN查询: ~50,000次 × 50ms = 40-50分钟 | |
| 190 | +- **总计: 约50-60分钟** | |
| 191 | + | |
| 192 | +### 优化建议 | |
| 193 | +1. 批量查询: 使用 `_mget` 批量获取向量 | |
| 194 | +2. 并发处理: 多线程/异步IO | |
| 195 | +3. 增量更新: 只处理变化的商品 | |
| 196 | +4. 缓存向量: 避免重复查询 | |
| 197 | + | |
| 198 | +## 🆚 与其他算法对比 | |
| 199 | + | |
| 200 | +| 算法 | 数据源 | 计算方式 | 特点 | 更新频率 | | |
| 201 | +|-----|-------|---------|------|---------| | |
| 202 | +| **Swing** | 用户行为 | 共现关系 | 捕获真实交互 | 每天 | | |
| 203 | +| **W2V** | 用户会话 | 序列学习 | 捕获序列关系 | 每天 | | |
| 204 | +| **DeepWalk** | 行为图 | 图游走 | 发现深层关联 | 每天 | | |
| 205 | +| **名称向量** | ES向量 | KNN查询 | 语义相似 | 每周 | | |
| 206 | +| **图片向量** | ES向量 | KNN查询 | 视觉相似 | 每周 | | |
| 207 | + | |
| 208 | +## 📋 待办事项 | |
| 209 | + | |
| 210 | +- [x] 重写 `i2i_content_similar.py` | |
| 211 | +- [x] 更新 `REDIS_DATA_SPEC.md` | |
| 212 | +- [x] 简化 `run_all.py` 参数 | |
| 213 | +- [x] 更新 `load_index_to_redis.py` | |
| 214 | +- [x] 编写技术文档 | |
| 215 | +- [ ] 添加单元测试 | |
| 216 | +- [ ] 性能优化(批量查询) | |
| 217 | +- [ ] 添加监控和告警 | |
| 218 | + | |
| 219 | +## ⚠️ 注意事项 | |
| 220 | + | |
| 221 | +1. **ES连接**: 确保能访问ES服务器 | |
| 222 | +2. **向量缺失**: 部分商品可能没有向量,会被跳过 | |
| 223 | +3. **网络延迟**: ES查询受网络影响,建议内网部署 | |
| 224 | +4. **内存占用**: 处理大量商品时注意内存使用 | |
| 225 | +5. **依赖安装**: 需要安装 `elasticsearch` Python包 | |
| 226 | + | |
| 227 | +## 🔗 相关文档 | |
| 228 | + | |
| 229 | +- `ES_VECTOR_SIMILARITY.md` - ES向量相似度详细说明 | |
| 230 | +- `REDIS_DATA_SPEC.md` - Redis数据规范 | |
| 231 | +- `OFFLINE_INDEX_SPEC.md` - 离线索引规范 | |
| 232 | +- `QUICKSTART.md` - 快速开始指南 | |
| 233 | + | |
| 234 | +## 📞 联系方式 | |
| 235 | + | |
| 236 | +如有问题或建议,请联系开发团队。 | |
| 237 | + | |
| 238 | +--- | |
| 239 | + | |
| 240 | +**更新日期**: 2025-10-17 | |
| 241 | +**版本**: v2.0 | |
| 242 | +**作者**: AI Assistant | |
| 243 | + | ... | ... |
| ... | ... | @@ -0,0 +1,321 @@ |
| 1 | +# 快速开始 - 新版本 | |
| 2 | + | |
| 3 | +## 🚀 5分钟快速上手 | |
| 4 | + | |
| 5 | +### 1. 安装依赖 | |
| 6 | + | |
| 7 | +```bash | |
| 8 | +cd /home/tw/recommendation | |
| 9 | +pip install -r requirements.txt | |
| 10 | +``` | |
| 11 | + | |
| 12 | +**新增依赖**: `elasticsearch>=8.0.0` | |
| 13 | + | |
| 14 | +### 2. 测试ES连接 | |
| 15 | + | |
| 16 | +```bash | |
| 17 | +cd offline_tasks | |
| 18 | +python scripts/test_es_connection.py | |
| 19 | +``` | |
| 20 | + | |
| 21 | +如果看到 ✓ 表示测试通过。 | |
| 22 | + | |
| 23 | +### 3. 运行所有任务 | |
| 24 | + | |
| 25 | +```bash | |
| 26 | +python run_all.py | |
| 27 | +``` | |
| 28 | + | |
| 29 | +就这么简单!不需要任何参数。 | |
| 30 | + | |
| 31 | +### 4. 加载到Redis | |
| 32 | + | |
| 33 | +```bash | |
| 34 | +python scripts/load_index_to_redis.py | |
| 35 | +``` | |
| 36 | + | |
| 37 | +## 📋 运行单个任务 | |
| 38 | + | |
| 39 | +### i2i相似索引 | |
| 40 | + | |
| 41 | +```bash | |
| 42 | +# Swing算法 | |
| 43 | +python scripts/i2i_swing.py --lookback_days 30 --top_n 50 --time_decay | |
| 44 | + | |
| 45 | +# Session W2V | |
| 46 | +python scripts/i2i_session_w2v.py --lookback_days 30 --top_n 50 --save_model | |
| 47 | + | |
| 48 | +# DeepWalk | |
| 49 | +python scripts/i2i_deepwalk.py --lookback_days 30 --top_n 50 --save_model | |
| 50 | + | |
| 51 | +# 内容相似(ES向量)- 无需参数! | |
| 52 | +python scripts/i2i_content_similar.py | |
| 53 | +``` | |
| 54 | + | |
| 55 | +### 兴趣聚合 | |
| 56 | + | |
| 57 | +```bash | |
| 58 | +python scripts/interest_aggregation.py --lookback_days 30 --top_n 1000 | |
| 59 | +``` | |
| 60 | + | |
| 61 | +## 🎯 主要变化 | |
| 62 | + | |
| 63 | +### 简化!简化!简化! | |
| 64 | + | |
| 65 | +#### 之前 (v1.0) | |
| 66 | +```bash | |
| 67 | +python run_all.py \ | |
| 68 | + --lookback_days 30 \ | |
| 69 | + --top_n 50 \ | |
| 70 | + --skip-interest \ | |
| 71 | + --only-content \ | |
| 72 | + --debug | |
| 73 | +``` | |
| 74 | + | |
| 75 | +#### 现在 (v2.0) | |
| 76 | +```bash | |
| 77 | +python run_all.py | |
| 78 | +# 或 | |
| 79 | +python run_all.py --debug # 启用debug模式 | |
| 80 | +``` | |
| 81 | + | |
| 82 | +### 内容相似索引 | |
| 83 | + | |
| 84 | +#### 之前 | |
| 85 | +- 1个索引: `i2i_content_hybrid_*.txt` | |
| 86 | +- 基于: 商品属性(分类、供应商等) | |
| 87 | +- 参数: `--method hybrid --top_n 50` | |
| 88 | + | |
| 89 | +#### 现在 | |
| 90 | +- **2个索引**: | |
| 91 | + - `i2i_content_name_*.txt` (名称向量) | |
| 92 | + - `i2i_content_pic_*.txt` (图片向量) | |
| 93 | +- 基于: Elasticsearch深度学习向量 | |
| 94 | +- 参数: **无需参数!** | |
| 95 | + | |
| 96 | +## 📊 输出文件 | |
| 97 | + | |
| 98 | +### 文件位置 | |
| 99 | +``` | |
| 100 | +offline_tasks/output/ | |
| 101 | +├── i2i_swing_20251017.txt # Swing相似索引 | |
| 102 | +├── i2i_session_w2v_20251017.txt # Session W2V相似索引 | |
| 103 | +├── i2i_deepwalk_20251017.txt # DeepWalk相似索引 | |
| 104 | +├── i2i_content_name_20251017.txt # 名称向量相似索引 ⭐新 | |
| 105 | +├── i2i_content_pic_20251017.txt # 图片向量相似索引 ⭐新 | |
| 106 | +├── interest_aggregation_hot_20251017.txt # 热门商品 | |
| 107 | +├── interest_aggregation_cart_20251017.txt # 加购商品 | |
| 108 | +├── interest_aggregation_new_20251017.txt # 新品 | |
| 109 | +└── interest_aggregation_global_20251017.txt # 全局热门 | |
| 110 | +``` | |
| 111 | + | |
| 112 | +### 文件格式 | |
| 113 | +``` | |
| 114 | +item_id \t item_name \t similar_id1:score1,similar_id2:score2,... | |
| 115 | +``` | |
| 116 | + | |
| 117 | +## 🔍 查询示例 | |
| 118 | + | |
| 119 | +### Python查询 | |
| 120 | + | |
| 121 | +```python | |
| 122 | +import redis | |
| 123 | +import json | |
| 124 | + | |
| 125 | +# 连接Redis | |
| 126 | +r = redis.Redis(host='localhost', port=6379, db=0, decode_responses=True) | |
| 127 | + | |
| 128 | +# 1. 获取Swing相似商品 | |
| 129 | +similar = json.loads(r.get('item:similar:swing:123456')) | |
| 130 | +# 返回: [[234567, 0.8523], [345678, 0.7842], ...] | |
| 131 | + | |
| 132 | +# 2. 获取名称向量相似商品 ⭐新 | |
| 133 | +similar = json.loads(r.get('item:similar:content_name:123456')) | |
| 134 | +# 返回: [[234567, 0.9234], [345678, 0.8756], ...] | |
| 135 | + | |
| 136 | +# 3. 获取图片向量相似商品 ⭐新 | |
| 137 | +similar = json.loads(r.get('item:similar:content_pic:123456')) | |
| 138 | +# 返回: [[567890, 0.8123], [678901, 0.7856], ...] | |
| 139 | + | |
| 140 | +# 4. 获取热门商品 | |
| 141 | +hot_items = json.loads(r.get('interest:hot:platform:PC')) | |
| 142 | +# 返回: [123456, 234567, 345678, ...] | |
| 143 | +``` | |
| 144 | + | |
| 145 | +### Redis CLI查询 | |
| 146 | + | |
| 147 | +```bash | |
| 148 | +# 连接Redis | |
| 149 | +redis-cli | |
| 150 | + | |
| 151 | +# 查看Swing相似商品 | |
| 152 | +GET item:similar:swing:123456 | |
| 153 | + | |
| 154 | +# 查看名称向量相似商品 ⭐新 | |
| 155 | +GET item:similar:content_name:123456 | |
| 156 | + | |
| 157 | +# 查看图片向量相似商品 ⭐新 | |
| 158 | +GET item:similar:content_pic:123456 | |
| 159 | + | |
| 160 | +# 查看热门商品 | |
| 161 | +GET interest:hot:platform:PC | |
| 162 | +``` | |
| 163 | + | |
| 164 | +## ⚙️ 配置说明 | |
| 165 | + | |
| 166 | +### ES配置 (i2i_content_similar.py) | |
| 167 | + | |
| 168 | +```python | |
| 169 | +ES_CONFIG = { | |
| 170 | + 'host': 'http://localhost:9200', | |
| 171 | + 'index_name': 'spu', | |
| 172 | + 'username': 'essa', | |
| 173 | + 'password': '4hOaLaf41y2VuI8y' | |
| 174 | +} | |
| 175 | +``` | |
| 176 | + | |
| 177 | +### 算法参数 (i2i_content_similar.py) | |
| 178 | + | |
| 179 | +```python | |
| 180 | +TOP_N = 50 # 每个商品返回50个相似商品 | |
| 181 | +KNN_K = 100 # KNN查询返回100个候选 | |
| 182 | +KNN_CANDIDATES = 200 # 候选池大小200 | |
| 183 | +``` | |
| 184 | + | |
| 185 | +### 全局配置 (offline_config.py) | |
| 186 | + | |
| 187 | +```python | |
| 188 | +DEFAULT_LOOKBACK_DAYS = 30 # 回看天数 | |
| 189 | +DEFAULT_I2I_TOP_N = 50 # i2i Top N | |
| 190 | +DEFAULT_INTEREST_TOP_N = 1000 # 兴趣聚合 Top N | |
| 191 | +``` | |
| 192 | + | |
| 193 | +## 🔧 故障排查 | |
| 194 | + | |
| 195 | +### ES连接失败 | |
| 196 | + | |
| 197 | +```bash | |
| 198 | +# 1. 检查ES是否运行 | |
| 199 | +curl -u essa:4hOaLaf41y2VuI8y http://localhost:9200 | |
| 200 | + | |
| 201 | +# 2. 运行测试脚本 | |
| 202 | +python scripts/test_es_connection.py | |
| 203 | + | |
| 204 | +# 3. 检查配置 | |
| 205 | +# 编辑 scripts/i2i_content_similar.py 中的 ES_CONFIG | |
| 206 | +``` | |
| 207 | + | |
| 208 | +### 商品ID不存在 | |
| 209 | + | |
| 210 | +测试脚本默认使用 `item_id = "3302275"`,如果不存在: | |
| 211 | + | |
| 212 | +```python | |
| 213 | +# 编辑 test_es_connection.py | |
| 214 | +test_item_id = "你的商品ID" | |
| 215 | +``` | |
| 216 | + | |
| 217 | +### Redis连接失败 | |
| 218 | + | |
| 219 | +```bash | |
| 220 | +# 检查Redis配置 | |
| 221 | +cat offline_tasks/config/offline_config.py | grep REDIS | |
| 222 | + | |
| 223 | +# 测试Redis连接 | |
| 224 | +redis-cli ping | |
| 225 | +``` | |
| 226 | + | |
| 227 | +### 文件不存在 | |
| 228 | + | |
| 229 | +```bash | |
| 230 | +# 检查output目录 | |
| 231 | +ls -lh offline_tasks/output/ | |
| 232 | + | |
| 233 | +# 查看最新生成的文件 | |
| 234 | +ls -lht offline_tasks/output/ | head -10 | |
| 235 | +``` | |
| 236 | + | |
| 237 | +## 📚 详细文档 | |
| 238 | + | |
| 239 | +- **ES向量相似度**: `scripts/ES_VECTOR_SIMILARITY.md` | |
| 240 | +- **更新说明**: `CONTENT_SIMILARITY_UPDATE.md` | |
| 241 | +- **变更总结**: `CHANGES_SUMMARY.md` | |
| 242 | +- **Redis规范**: `REDIS_DATA_SPEC.md` | |
| 243 | + | |
| 244 | +## 🎓 学习路径 | |
| 245 | + | |
| 246 | +### 新用户 | |
| 247 | +1. 阅读本文档 ✓ | |
| 248 | +2. 运行 `test_es_connection.py` | |
| 249 | +3. 运行 `run_all.py` | |
| 250 | +4. 查看 `output/` 目录 | |
| 251 | +5. 加载到Redis并查询 | |
| 252 | + | |
| 253 | +### 进阶使用 | |
| 254 | +1. 阅读 `ES_VECTOR_SIMILARITY.md` | |
| 255 | +2. 了解向量相似度原理 | |
| 256 | +3. 优化ES查询性能 | |
| 257 | +4. 自定义算法参数 | |
| 258 | + | |
| 259 | +### 开发者 | |
| 260 | +1. 阅读 `CONTENT_SIMILARITY_UPDATE.md` | |
| 261 | +2. 了解技术架构 | |
| 262 | +3. 阅读源代码注释 | |
| 263 | +4. 贡献代码改进 | |
| 264 | + | |
| 265 | +## 🚨 注意事项 | |
| 266 | + | |
| 267 | +### ⚠️ 破坏性变化 | |
| 268 | + | |
| 269 | +1. **i2i_content_similar.py 参数全部改变** | |
| 270 | + - 旧: `--method`, `--top_n`, `--debug` | |
| 271 | + - 新: 无参数 | |
| 272 | + | |
| 273 | +2. **Redis Key格式改变** | |
| 274 | + - 旧: `item:similar:content:{item_id}` | |
| 275 | + - 新: `item:similar:content_name:{item_id}` 和 `item:similar:content_pic:{item_id}` | |
| 276 | + | |
| 277 | +3. **输出文件改变** | |
| 278 | + - 旧: `i2i_content_hybrid_*.txt` | |
| 279 | + - 新: `i2i_content_name_*.txt` 和 `i2i_content_pic_*.txt` | |
| 280 | + | |
| 281 | +### ✅ 向后兼容 | |
| 282 | + | |
| 283 | +- Swing、W2V、DeepWalk 算法不受影响 | |
| 284 | +- 兴趣聚合不受影响 | |
| 285 | +- Redis加载器向后兼容 | |
| 286 | +- 其他i2i索引继续工作 | |
| 287 | + | |
| 288 | +## 💡 最佳实践 | |
| 289 | + | |
| 290 | +### 运行频率 | |
| 291 | +- **行为相似** (Swing, W2V, DeepWalk): 每天 | |
| 292 | +- **内容相似** (名称向量, 图片向量): 每周 | |
| 293 | +- **兴趣聚合**: 每天 | |
| 294 | + | |
| 295 | +### Redis TTL | |
| 296 | +- **行为相似**: 7天 | |
| 297 | +- **内容相似**: 30天 | |
| 298 | +- **兴趣聚合**: 3-7天 | |
| 299 | + | |
| 300 | +### 性能优化 | |
| 301 | +1. 使用 `--debug` 模式调试 | |
| 302 | +2. 先用小数据集测试 | |
| 303 | +3. 定期清理过期数据 | |
| 304 | +4. 监控ES查询性能 | |
| 305 | + | |
| 306 | +## 🎉 总结 | |
| 307 | + | |
| 308 | +新版本大幅简化了使用,主要改进: | |
| 309 | + | |
| 310 | +1. ✅ **无需参数**: `run_all.py` 和 `i2i_content_similar.py` 无需参数 | |
| 311 | +2. ✅ **更强大**: 基于深度学习向量,更准确 | |
| 312 | +3. ✅ **多维度**: 名称 + 图片两个维度 | |
| 313 | +4. ✅ **更快**: ES KNN查询性能优秀 | |
| 314 | +5. ✅ **易维护**: 代码简洁,配置清晰 | |
| 315 | + | |
| 316 | +开始使用新版本,享受更简单、更强大的推荐系统! | |
| 317 | + | |
| 318 | +--- | |
| 319 | + | |
| 320 | +**问题反馈**: 如有问题请查看详细文档或联系开发团队 | |
| 321 | + | ... | ... |
offline_tasks/REDIS_DATA_SPEC.md
| ... | ... | @@ -23,7 +23,8 @@ |
| 23 | 23 | | **i2i_swing** | `output/i2i_swing_YYYYMMDD.txt` | `item_id\titem_name\tsimilar_id1:score1,...` | `item:similar:swing:{item_id}` | `[[similar_id1,score1],[similar_id2,score2],...]` | 7天 | |
| 24 | 24 | | **i2i_session_w2v** | `output/i2i_session_w2v_YYYYMMDD.txt` | `item_id\titem_name\tsimilar_id1:score1,...` | `item:similar:w2v:{item_id}` | `[[similar_id1,score1],[similar_id2,score2],...]` | 7天 | |
| 25 | 25 | | **i2i_deepwalk** | `output/i2i_deepwalk_YYYYMMDD.txt` | `item_id\titem_name\tsimilar_id1:score1,...` | `item:similar:deepwalk:{item_id}` | `[[similar_id1,score1],[similar_id2,score2],...]` | 7天 | |
| 26 | -| **i2i_content** | `output/i2i_content_hybrid_YYYYMMDD.txt` | `item_id\titem_name\tsimilar_id1:score1,...` | `item:similar:content:{item_id}` | `[[similar_id1,score1],[similar_id2,score2],...]` | 30天 | | |
| 26 | +| **i2i_content_name** | `output/i2i_content_name_YYYYMMDD.txt` | `item_id\titem_name\tsimilar_id1:score1,...` | `item:similar:content_name:{item_id}` | `[[similar_id1,score1],[similar_id2,score2],...]` | 30天 | | |
| 27 | +| **i2i_content_pic** | `output/i2i_content_pic_YYYYMMDD.txt` | `item_id\titem_name\tsimilar_id1:score1,...` | `item:similar:content_pic:{item_id}` | `[[similar_id1,score1],[similar_id2,score2],...]` | 30天 | | |
| 27 | 28 | | **interest_hot** | `output/interest_aggregation_hot_YYYYMMDD.txt` | `dimension_key\titem_id1,item_id2,...` | `interest:hot:{dimension_key}` | `[item_id1,item_id2,item_id3,...]` | 3天 | |
| 28 | 29 | | **interest_cart** | `output/interest_aggregation_cart_YYYYMMDD.txt` | `dimension_key\titem_id1,item_id2,...` | `interest:cart:{dimension_key}` | `[item_id1,item_id2,item_id3,...]` | 3天 | |
| 29 | 30 | | **interest_new** | `output/interest_aggregation_new_YYYYMMDD.txt` | `dimension_key\titem_id1,item_id2,...` | `interest:new:{dimension_key}` | `[item_id1,item_id2,item_id3,...]` | 3天 | |
| ... | ... | @@ -246,12 +247,13 @@ TTL item:similar:swing:12345 |
| 246 | 247 | | i2i_swing | 50,000 | ~500B | ~25MB | |
| 247 | 248 | | i2i_w2v | 50,000 | ~500B | ~25MB | |
| 248 | 249 | | i2i_deepwalk | 50,000 | ~500B | ~25MB | |
| 249 | -| i2i_content | 50,000 | ~500B | ~25MB | | |
| 250 | +| i2i_content_name | 50,000 | ~500B | ~25MB | | |
| 251 | +| i2i_content_pic | 50,000 | ~500B | ~25MB | | |
| 250 | 252 | | interest_hot | 10,000 | ~1KB | ~10MB | |
| 251 | 253 | | interest_cart | 10,000 | ~1KB | ~10MB | |
| 252 | 254 | | interest_new | 5,000 | ~1KB | ~5MB | |
| 253 | 255 | | interest_global | 10,000 | ~1KB | ~10MB | |
| 254 | -| **总计** | **245,000** | - | **~135MB** | | |
| 256 | +| **总计** | **270,000** | - | **~160MB** | | |
| 255 | 257 | |
| 256 | 258 | ### 过期策略 |
| 257 | 259 | ... | ... |
offline_tasks/run_all.py
| ... | ... | @@ -81,17 +81,6 @@ def run_script(script_name, args=None): |
| 81 | 81 | |
| 82 | 82 | def main(): |
| 83 | 83 | parser = argparse.ArgumentParser(description='Run all offline recommendation tasks') |
| 84 | - parser.add_argument('--skip-i2i', action='store_true', help='Skip i2i tasks') | |
| 85 | - parser.add_argument('--skip-interest', action='store_true', help='Skip interest aggregation') | |
| 86 | - parser.add_argument('--only-swing', action='store_true', help='Run only Swing algorithm') | |
| 87 | - parser.add_argument('--only-w2v', action='store_true', help='Run only Session W2V') | |
| 88 | - parser.add_argument('--only-deepwalk', action='store_true', help='Run only DeepWalk') | |
| 89 | - parser.add_argument('--only-content', action='store_true', help='Run only Content-based similarity') | |
| 90 | - parser.add_argument('--only-interest', action='store_true', help='Run only interest aggregation') | |
| 91 | - parser.add_argument('--lookback_days', type=int, default=DEFAULT_LOOKBACK_DAYS, | |
| 92 | - help=f'Lookback days (default: {DEFAULT_LOOKBACK_DAYS}, adjust in offline_config.py)') | |
| 93 | - parser.add_argument('--top_n', type=int, default=DEFAULT_I2I_TOP_N, | |
| 94 | - help=f'Top N similar items (default: {DEFAULT_I2I_TOP_N})') | |
| 95 | 84 | parser.add_argument('--debug', action='store_true', |
| 96 | 85 | help='Enable debug mode for all tasks (detailed logs + readable output files)') |
| 97 | 86 | |
| ... | ... | @@ -107,86 +96,72 @@ def main(): |
| 107 | 96 | total_count = 0 |
| 108 | 97 | |
| 109 | 98 | # i2i 行为相似任务 |
| 110 | - if not args.skip_i2i: | |
| 111 | - # 1. Swing算法 | |
| 112 | - if not args.only_w2v and not args.only_deepwalk and not args.only_interest and not args.only_content: | |
| 113 | - logger.info("\n" + "="*80) | |
| 114 | - logger.info("Task 1: Running Swing algorithm for i2i similarity") | |
| 115 | - logger.info("="*80) | |
| 116 | - total_count += 1 | |
| 117 | - script_args = [ | |
| 118 | - '--lookback_days', str(args.lookback_days), | |
| 119 | - '--top_n', str(args.top_n), | |
| 120 | - '--time_decay' | |
| 121 | - ] | |
| 122 | - if args.debug: | |
| 123 | - script_args.append('--debug') | |
| 124 | - if run_script('i2i_swing.py', script_args): | |
| 125 | - success_count += 1 | |
| 126 | - | |
| 127 | - # 2. Session W2V | |
| 128 | - if not args.only_swing and not args.only_deepwalk and not args.only_interest and not args.only_content: | |
| 129 | - logger.info("\n" + "="*80) | |
| 130 | - logger.info("Task 2: Running Session Word2Vec for i2i similarity") | |
| 131 | - logger.info("="*80) | |
| 132 | - total_count += 1 | |
| 133 | - script_args = [ | |
| 134 | - '--lookback_days', str(args.lookback_days), | |
| 135 | - '--top_n', str(args.top_n), | |
| 136 | - '--save_model' | |
| 137 | - ] | |
| 138 | - if args.debug: | |
| 139 | - script_args.append('--debug') | |
| 140 | - if run_script('i2i_session_w2v.py', script_args): | |
| 141 | - success_count += 1 | |
| 142 | - | |
| 143 | - # 3. DeepWalk | |
| 144 | - if not args.only_swing and not args.only_w2v and not args.only_interest and not args.only_content: | |
| 145 | - logger.info("\n" + "="*80) | |
| 146 | - logger.info("Task 3: Running DeepWalk for i2i similarity") | |
| 147 | - logger.info("="*80) | |
| 148 | - total_count += 1 | |
| 149 | - script_args = [ | |
| 150 | - '--lookback_days', str(args.lookback_days), | |
| 151 | - '--top_n', str(args.top_n), | |
| 152 | - '--save_model', | |
| 153 | - '--save_graph' | |
| 154 | - ] | |
| 155 | - if args.debug: | |
| 156 | - script_args.append('--debug') | |
| 157 | - if run_script('i2i_deepwalk.py', script_args): | |
| 158 | - success_count += 1 | |
| 159 | - | |
| 160 | - # 4. Content-based similarity | |
| 161 | -# if not args.only_swing and not args.only_w2v and not args.only_deepwalk and not args.only_interest: | |
| 162 | -# logger.info("\n" + "="*80) | |
| 163 | -# logger.info("Task 4: Running Content-based similarity") | |
| 164 | -# logger.info("="*80) | |
| 165 | -# total_count += 1 | |
| 166 | -# script_args = [ | |
| 167 | -# '--top_n', str(args.top_n), | |
| 168 | -# '--method', 'hybrid' | |
| 169 | -# ] | |
| 170 | -# if args.debug: | |
| 171 | -# script_args.append('--debug') | |
| 172 | -# if run_script('i2i_content_similar.py', script_args): | |
| 173 | -# success_count += 1 | |
| 174 | - | |
| 175 | - # 兴趣点聚合任务 | |
| 176 | - if not args.skip_interest: | |
| 177 | - if not args.only_swing and not args.only_w2v and not args.only_deepwalk and not args.only_content: | |
| 178 | - logger.info("\n" + "="*80) | |
| 179 | - logger.info("Task 5: Running interest aggregation") | |
| 180 | - logger.info("="*80) | |
| 181 | - total_count += 1 | |
| 182 | - script_args = [ | |
| 183 | - '--lookback_days', str(args.lookback_days), | |
| 184 | - '--top_n', str(DEFAULT_INTEREST_TOP_N) | |
| 185 | - ] | |
| 186 | - if args.debug: | |
| 187 | - script_args.append('--debug') | |
| 188 | - if run_script('interest_aggregation.py', script_args): | |
| 189 | - success_count += 1 | |
| 99 | + logger.info("\n" + "="*80) | |
| 100 | + logger.info("Task 1: Running Swing algorithm for i2i similarity") | |
| 101 | + logger.info("="*80) | |
| 102 | + total_count += 1 | |
| 103 | + script_args = [ | |
| 104 | + '--lookback_days', str(DEFAULT_LOOKBACK_DAYS), | |
| 105 | + '--top_n', str(DEFAULT_I2I_TOP_N), | |
| 106 | + '--time_decay' | |
| 107 | + ] | |
| 108 | + if args.debug: | |
| 109 | + script_args.append('--debug') | |
| 110 | + if run_script('i2i_swing.py', script_args): | |
| 111 | + success_count += 1 | |
| 112 | + | |
| 113 | + # 2. Session W2V | |
| 114 | + logger.info("\n" + "="*80) | |
| 115 | + logger.info("Task 2: Running Session Word2Vec for i2i similarity") | |
| 116 | + logger.info("="*80) | |
| 117 | + total_count += 1 | |
| 118 | + script_args = [ | |
| 119 | + '--lookback_days', str(DEFAULT_LOOKBACK_DAYS), | |
| 120 | + '--top_n', str(DEFAULT_I2I_TOP_N), | |
| 121 | + '--save_model' | |
| 122 | + ] | |
| 123 | + if args.debug: | |
| 124 | + script_args.append('--debug') | |
| 125 | + if run_script('i2i_session_w2v.py', script_args): | |
| 126 | + success_count += 1 | |
| 127 | + | |
| 128 | + # 3. DeepWalk | |
| 129 | + logger.info("\n" + "="*80) | |
| 130 | + logger.info("Task 3: Running DeepWalk for i2i similarity") | |
| 131 | + logger.info("="*80) | |
| 132 | + total_count += 1 | |
| 133 | + script_args = [ | |
| 134 | + '--lookback_days', str(DEFAULT_LOOKBACK_DAYS), | |
| 135 | + '--top_n', str(DEFAULT_I2I_TOP_N), | |
| 136 | + '--save_model', | |
| 137 | + '--save_graph' | |
| 138 | + ] | |
| 139 | + if args.debug: | |
| 140 | + script_args.append('--debug') | |
| 141 | + if run_script('i2i_deepwalk.py', script_args): | |
| 142 | + success_count += 1 | |
| 143 | + | |
| 144 | + # 4. Content-based similarity (ES vectors) | |
| 145 | + logger.info("\n" + "="*80) | |
| 146 | + logger.info("Task 4: Running Content-based similarity (ES vectors)") | |
| 147 | + logger.info("="*80) | |
| 148 | + total_count += 1 | |
| 149 | + if run_script('i2i_content_similar.py', []): | |
| 150 | + success_count += 1 | |
| 151 | + | |
| 152 | + # 5. 兴趣点聚合任务 | |
| 153 | + logger.info("\n" + "="*80) | |
| 154 | + logger.info("Task 5: Running interest aggregation") | |
| 155 | + logger.info("="*80) | |
| 156 | + total_count += 1 | |
| 157 | + script_args = [ | |
| 158 | + '--lookback_days', str(DEFAULT_LOOKBACK_DAYS), | |
| 159 | + '--top_n', str(DEFAULT_INTEREST_TOP_N) | |
| 160 | + ] | |
| 161 | + if args.debug: | |
| 162 | + script_args.append('--debug') | |
| 163 | + if run_script('interest_aggregation.py', script_args): | |
| 164 | + success_count += 1 | |
| 190 | 165 | |
| 191 | 166 | # 总结 |
| 192 | 167 | logger.info("\n" + "="*80) | ... | ... |
| ... | ... | @@ -0,0 +1,252 @@ |
| 1 | +# ES向量相似度索引生成 | |
| 2 | + | |
| 3 | +## 概述 | |
| 4 | + | |
| 5 | +`i2i_content_similar.py` 脚本从Elasticsearch获取商品向量,计算并生成两种内容相似度索引: | |
| 6 | + | |
| 7 | +1. **基于名称文本向量的相似度** (`i2i_content_name`) | |
| 8 | +2. **基于图片向量的相似度** (`i2i_content_pic`) | |
| 9 | + | |
| 10 | +## 使用方法 | |
| 11 | + | |
| 12 | +### 运行脚本 | |
| 13 | + | |
| 14 | +```bash | |
| 15 | +cd /home/tw/recommendation/offline_tasks | |
| 16 | +python scripts/i2i_content_similar.py | |
| 17 | +``` | |
| 18 | + | |
| 19 | +脚本无需任何参数,所有配置都在代码中设置好。 | |
| 20 | + | |
| 21 | +### 配置说明 | |
| 22 | + | |
| 23 | +脚本内置配置(位于 `i2i_content_similar.py` 头部): | |
| 24 | + | |
| 25 | +```python | |
| 26 | +# ES配置 | |
| 27 | +ES_CONFIG = { | |
| 28 | + 'host': 'http://localhost:9200', | |
| 29 | + 'index_name': 'spu', | |
| 30 | + 'username': 'essa', | |
| 31 | + 'password': '4hOaLaf41y2VuI8y' | |
| 32 | +} | |
| 33 | + | |
| 34 | +# 算法参数 | |
| 35 | +TOP_N = 50 # 每个商品返回的相似商品数量 | |
| 36 | +KNN_K = 100 # knn查询返回的候选数 | |
| 37 | +KNN_CANDIDATES = 200 # knn查询的候选池大小 | |
| 38 | +``` | |
| 39 | + | |
| 40 | +## 工作流程 | |
| 41 | + | |
| 42 | +### 1. 获取活跃商品 | |
| 43 | + | |
| 44 | +从数据库查询最近1年内有过行为的商品: | |
| 45 | + | |
| 46 | +```sql | |
| 47 | +SELECT DISTINCT item_id | |
| 48 | +FROM sensors_events | |
| 49 | +WHERE event IN ('click', 'contactFactory', 'addToPool', 'addToCart', 'purchase') | |
| 50 | + AND create_time >= '1年前' | |
| 51 | + AND item_id IS NOT NULL | |
| 52 | +``` | |
| 53 | + | |
| 54 | +### 2. 从ES获取向量 | |
| 55 | + | |
| 56 | +对每个活跃商品,从Elasticsearch查询: | |
| 57 | + | |
| 58 | +```json | |
| 59 | +{ | |
| 60 | + "query": { | |
| 61 | + "term": { | |
| 62 | + "_id": "商品ID" | |
| 63 | + } | |
| 64 | + }, | |
| 65 | + "_source": { | |
| 66 | + "includes": ["_id", "name_zh", "embedding_name_zh", "embedding_pic_h14"] | |
| 67 | + } | |
| 68 | +} | |
| 69 | +``` | |
| 70 | + | |
| 71 | +返回字段: | |
| 72 | +- `_id`: 商品ID | |
| 73 | +- `name_zh`: 中文名称(用于debug输出) | |
| 74 | +- `embedding_name_zh`: 名称文本向量 (1024维) | |
| 75 | +- `embedding_pic_h14`: 图片向量列表,每个元素包含: | |
| 76 | + - `vector`: 向量 (1024维) | |
| 77 | + - `url`: 图片URL | |
| 78 | + | |
| 79 | +### 3. KNN向量相似度查询 | |
| 80 | + | |
| 81 | +使用商品的向量查询相似商品: | |
| 82 | + | |
| 83 | +**名称向量查询:** | |
| 84 | +```json | |
| 85 | +{ | |
| 86 | + "knn": { | |
| 87 | + "field": "embedding_name_zh", | |
| 88 | + "query_vector": [向量值], | |
| 89 | + "k": 100, | |
| 90 | + "num_candidates": 200 | |
| 91 | + }, | |
| 92 | + "_source": ["_id", "name_zh"], | |
| 93 | + "size": 100 | |
| 94 | +} | |
| 95 | +``` | |
| 96 | + | |
| 97 | +**图片向量查询:** | |
| 98 | +```json | |
| 99 | +{ | |
| 100 | + "knn": { | |
| 101 | + "field": "embedding_pic_h14.vector", | |
| 102 | + "query_vector": [向量值], | |
| 103 | + "k": 100, | |
| 104 | + "num_candidates": 200 | |
| 105 | + }, | |
| 106 | + "_source": ["_id", "name_zh"], | |
| 107 | + "size": 100 | |
| 108 | +} | |
| 109 | +``` | |
| 110 | + | |
| 111 | +### 4. 生成索引文件 | |
| 112 | + | |
| 113 | +输出两个文件到 `output/` 目录: | |
| 114 | + | |
| 115 | +- `i2i_content_name_YYYYMMDD.txt`: 基于名称向量的相似索引 | |
| 116 | +- `i2i_content_pic_YYYYMMDD.txt`: 基于图片向量的相似索引 | |
| 117 | + | |
| 118 | +**文件格式:** | |
| 119 | +``` | |
| 120 | +item_id\titem_name\tsimilar_id1:score1,similar_id2:score2,... | |
| 121 | +``` | |
| 122 | + | |
| 123 | +**示例:** | |
| 124 | +``` | |
| 125 | +123456 香蕉干 234567:0.9234,345678:0.8756,456789:0.8432 | |
| 126 | +``` | |
| 127 | + | |
| 128 | +## 输出说明 | |
| 129 | + | |
| 130 | +### Redis Key 格式 | |
| 131 | + | |
| 132 | +#### 名称向量相似 | |
| 133 | +- **Key**: `item:similar:content_name:{item_id}` | |
| 134 | +- **Value**: `[[similar_id1,score1],[similar_id2,score2],...]` | |
| 135 | +- **TTL**: 30天 | |
| 136 | + | |
| 137 | +#### 图片向量相似 | |
| 138 | +- **Key**: `item:similar:content_pic:{item_id}` | |
| 139 | +- **Value**: `[[similar_id1,score1],[similar_id2,score2],...]` | |
| 140 | +- **TTL**: 30天 | |
| 141 | + | |
| 142 | +### 使用示例 | |
| 143 | + | |
| 144 | +```python | |
| 145 | +import redis | |
| 146 | +import json | |
| 147 | + | |
| 148 | +r = redis.Redis(host='localhost', port=6379, db=0) | |
| 149 | + | |
| 150 | +# 获取基于名称向量的相似商品 | |
| 151 | +similar_items = json.loads(r.get('item:similar:content_name:123456')) | |
| 152 | +# 返回: [[234567, 0.9234], [345678, 0.8756], ...] | |
| 153 | + | |
| 154 | +# 获取基于图片向量的相似商品 | |
| 155 | +similar_items = json.loads(r.get('item:similar:content_pic:123456')) | |
| 156 | +# 返回: [[567890, 0.8123], [678901, 0.7856], ...] | |
| 157 | +``` | |
| 158 | + | |
| 159 | +## 性能说明 | |
| 160 | + | |
| 161 | +### 运行时间估算 | |
| 162 | + | |
| 163 | +假设有 50,000 个活跃商品: | |
| 164 | + | |
| 165 | +- ES查询获取向量: ~50,000次,每次约10ms = 8-10分钟 | |
| 166 | +- KNN相似度查询: ~50,000次,每次约50ms = 40-50分钟 | |
| 167 | +- 总计: 约50-60分钟 | |
| 168 | + | |
| 169 | +### 优化建议 | |
| 170 | + | |
| 171 | +如果性能不够: | |
| 172 | + | |
| 173 | +1. **批量处理**: 使用ES的 `_mget` 批量获取向量 | |
| 174 | +2. **并发查询**: 使用多线程/异步IO提高查询并发 | |
| 175 | +3. **增量更新**: 只处理新增/更新的商品 | |
| 176 | +4. **缓存结果**: 将ES向量缓存到本地,避免重复查询 | |
| 177 | + | |
| 178 | +## ES向量字段说明 | |
| 179 | + | |
| 180 | +### embedding_name_zh | |
| 181 | + | |
| 182 | +- **类型**: `dense_vector` | |
| 183 | +- **维度**: 1024 | |
| 184 | +- **相似度**: `dot_product` | |
| 185 | +- **用途**: 基于商品名称的语义向量 | |
| 186 | + | |
| 187 | +### embedding_pic_h14 | |
| 188 | + | |
| 189 | +- **类型**: `nested` | |
| 190 | +- **结构**: | |
| 191 | + ```json | |
| 192 | + [ | |
| 193 | + { | |
| 194 | + "vector": [1024维向量], | |
| 195 | + "url": "图片URL" | |
| 196 | + } | |
| 197 | + ] | |
| 198 | + ``` | |
| 199 | +- **相似度**: `dot_product` | |
| 200 | +- **用途**: 基于商品图片的视觉向量 | |
| 201 | + | |
| 202 | +## 注意事项 | |
| 203 | + | |
| 204 | +1. **网络连接**: 确保能访问ES服务器 | |
| 205 | +2. **权限**: 确保ES用户有查询权限 | |
| 206 | +3. **向量缺失**: 部分商品可能没有向量,会被跳过 | |
| 207 | +4. **向量格式**: 图片向量是嵌套结构,取第一个图片的向量 | |
| 208 | +5. **自我排除**: KNN结果会排除商品自己 | |
| 209 | + | |
| 210 | +## 故障排查 | |
| 211 | + | |
| 212 | +### 连接ES失败 | |
| 213 | + | |
| 214 | +```python | |
| 215 | +# 检查ES配置 | |
| 216 | +curl -u essa:4hOaLaf41y2VuI8y http://localhost:9200/_cat/indices | |
| 217 | +``` | |
| 218 | + | |
| 219 | +### 查询超时 | |
| 220 | + | |
| 221 | +调整超时参数: | |
| 222 | +```python | |
| 223 | +es = Elasticsearch( | |
| 224 | + [ES_CONFIG['host']], | |
| 225 | + basic_auth=(ES_CONFIG['username'], ES_CONFIG['password']), | |
| 226 | + request_timeout=60 # 增加到60秒 | |
| 227 | +) | |
| 228 | +``` | |
| 229 | + | |
| 230 | +### 向量字段不存在 | |
| 231 | + | |
| 232 | +检查ES mapping: | |
| 233 | +```bash | |
| 234 | +curl -u essa:4hOaLaf41y2VuI8y http://localhost:9200/spu/_mapping | |
| 235 | +``` | |
| 236 | + | |
| 237 | +## 与其他相似度算法的对比 | |
| 238 | + | |
| 239 | +| 算法 | 数据源 | 优势 | 适用场景 | | |
| 240 | +|------|--------|------|---------| | |
| 241 | +| **Swing** | 用户行为 | 捕获真实交互关系 | 行为相似推荐 | | |
| 242 | +| **W2V** | 用户会话 | 捕获序列关系 | 下一个商品推荐 | | |
| 243 | +| **DeepWalk** | 行为图 | 发现深层关联 | 潜在兴趣挖掘 | | |
| 244 | +| **名称向量** | ES语义向量 | 语义理解强 | 文本相似推荐 | | |
| 245 | +| **图片向量** | ES视觉向量 | 视觉相似性强 | 外观相似推荐 | | |
| 246 | + | |
| 247 | +## 更新频率建议 | |
| 248 | + | |
| 249 | +- **名称向量相似**: 每周更新(商品名称变化少) | |
| 250 | +- **图片向量相似**: 每周更新(商品图片变化少) | |
| 251 | +- **Redis TTL**: 30天(内容相似度变化慢) | |
| 252 | + | ... | ... |
offline_tasks/scripts/i2i_content_similar.py
| 1 | 1 | """ |
| 2 | -i2i - 内容相似索引 | |
| 3 | -基于商品属性(分类、供应商、属性等)计算物品相似度 | |
| 2 | +i2i - 基于ES向量的内容相似索引 | |
| 3 | +从Elasticsearch获取商品向量,计算两种相似度: | |
| 4 | +1. 基于名称文本向量的相似度 | |
| 5 | +2. 基于图片向量的相似度 | |
| 4 | 6 | """ |
| 5 | 7 | import sys |
| 6 | 8 | import os |
| 7 | 9 | sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) |
| 8 | 10 | |
| 11 | +import json | |
| 9 | 12 | import pandas as pd |
| 10 | -import numpy as np | |
| 11 | -import argparse | |
| 12 | -from datetime import datetime | |
| 13 | -from collections import defaultdict | |
| 14 | -from sklearn.feature_extraction.text import TfidfVectorizer | |
| 15 | -from sklearn.metrics.pairwise import cosine_similarity | |
| 13 | +from datetime import datetime, timedelta | |
| 14 | +from elasticsearch import Elasticsearch | |
| 16 | 15 | from db_service import create_db_connection |
| 17 | -from offline_tasks.config.offline_config import ( | |
| 18 | - DB_CONFIG, OUTPUT_DIR, DEFAULT_I2I_TOP_N | |
| 19 | -) | |
| 20 | -from offline_tasks.scripts.debug_utils import ( | |
| 21 | - setup_debug_logger, log_dataframe_info, log_dict_stats, | |
| 22 | - save_readable_index, fetch_name_mappings, log_algorithm_params, | |
| 23 | - log_processing_step | |
| 24 | -) | |
| 16 | +from offline_tasks.config.offline_config import DB_CONFIG, OUTPUT_DIR | |
| 17 | +from offline_tasks.scripts.debug_utils import setup_debug_logger, log_processing_step | |
| 25 | 18 | |
| 19 | +# ES配置 | |
| 20 | +ES_CONFIG = { | |
| 21 | + 'host': 'http://localhost:9200', | |
| 22 | + 'index_name': 'spu', | |
| 23 | + 'username': 'essa', | |
| 24 | + 'password': '4hOaLaf41y2VuI8y' | |
| 25 | +} | |
| 26 | 26 | |
| 27 | -def fetch_product_features(engine): | |
| 27 | +# 算法参数 | |
| 28 | +TOP_N = 50 # 每个商品返回的相似商品数量 | |
| 29 | +KNN_K = 100 # knn查询返回的候选数 | |
| 30 | +KNN_CANDIDATES = 200 # knn查询的候选池大小 | |
| 31 | + | |
| 32 | + | |
| 33 | +def get_active_items(engine): | |
| 28 | 34 | """ |
| 29 | - 获取商品特征数据 | |
| 35 | + 获取最近1年有过行为的item列表 | |
| 30 | 36 | """ |
| 31 | - sql_query = """ | |
| 32 | - SELECT | |
| 33 | - pgs.id as item_id, | |
| 34 | - pgs.name as item_name, | |
| 35 | - pg.supplier_id, | |
| 36 | - ss.name as supplier_name, | |
| 37 | - pg.category_id, | |
| 38 | - pc_1.id as category_level1_id, | |
| 39 | - pc_1.name as category_level1, | |
| 40 | - pc_2.id as category_level2_id, | |
| 41 | - pc_2.name as category_level2, | |
| 42 | - pc_3.id as category_level3_id, | |
| 43 | - pc_3.name as category_level3, | |
| 44 | - pc_4.id as category_level4_id, | |
| 45 | - pc_4.name as category_level4, | |
| 46 | - pgs.capacity, | |
| 47 | - pgs.factory_no, | |
| 48 | - po.name as package_type, | |
| 49 | - po2.name as package_mode, | |
| 50 | - pgs.fir_on_sell_time, | |
| 51 | - pgs.status | |
| 52 | - FROM prd_goods_sku pgs | |
| 53 | - INNER JOIN prd_goods pg ON pg.id = pgs.goods_id | |
| 54 | - INNER JOIN sup_supplier ss ON ss.id = pg.supplier_id | |
| 55 | - LEFT JOIN prd_category as pc ON pc.id = pg.category_id | |
| 56 | - LEFT JOIN prd_category AS pc_1 ON pc_1.id = SUBSTRING_INDEX(SUBSTRING_INDEX(pc.path, '.', 2), '.', -1) | |
| 57 | - LEFT JOIN prd_category AS pc_2 ON pc_2.id = SUBSTRING_INDEX(SUBSTRING_INDEX(pc.path, '.', 3), '.', -1) | |
| 58 | - LEFT JOIN prd_category AS pc_3 ON pc_3.id = SUBSTRING_INDEX(SUBSTRING_INDEX(pc.path, '.', 4), '.', -1) | |
| 59 | - LEFT JOIN prd_category AS pc_4 ON pc_4.id = SUBSTRING_INDEX(SUBSTRING_INDEX(pc.path, '.', 5), '.', -1) | |
| 60 | - LEFT JOIN prd_goods_sku_attribute pgsa ON pgs.id = pgsa.goods_sku_id | |
| 61 | - AND pgsa.attribute_id = (SELECT id FROM prd_attribute WHERE code = 'PKG' LIMIT 1) | |
| 62 | - LEFT JOIN prd_option po ON po.id = pgsa.option_id | |
| 63 | - LEFT JOIN prd_goods_sku_attribute pgsa2 ON pgs.id = pgsa2.goods_sku_id | |
| 64 | - AND pgsa2.attribute_id = (SELECT id FROM prd_attribute WHERE code = 'pkg_mode' LIMIT 1) | |
| 65 | - LEFT JOIN prd_option po2 ON po2.id = pgsa2.option_id | |
| 66 | - WHERE pgs.status IN (2, 4, 5) | |
| 67 | - AND pgs.is_delete = 0 | |
| 37 | + one_year_ago = (datetime.now() - timedelta(days=365)).strftime('%Y-%m-%d') | |
| 38 | + | |
| 39 | + sql_query = f""" | |
| 40 | + SELECT DISTINCT | |
| 41 | + se.item_id | |
| 42 | + FROM | |
| 43 | + sensors_events se | |
| 44 | + WHERE | |
| 45 | + se.event IN ('click', 'contactFactory', 'addToPool', 'addToCart', 'purchase') | |
| 46 | + AND se.create_time >= '{one_year_ago}' | |
| 47 | + AND se.item_id IS NOT NULL | |
| 68 | 48 | """ |
| 69 | 49 | |
| 70 | - print("Executing SQL query...") | |
| 71 | 50 | df = pd.read_sql(sql_query, engine) |
| 72 | - print(f"Fetched {len(df)} products") | |
| 73 | - return df | |
| 51 | + return df['item_id'].tolist() | |
| 74 | 52 | |
| 75 | 53 | |
| 76 | -def build_feature_text(row): | |
| 77 | - """ | |
| 78 | - 构建商品的特征文本 | |
| 54 | +def connect_es(): | |
| 55 | + """连接到Elasticsearch""" | |
| 56 | + es = Elasticsearch( | |
| 57 | + [ES_CONFIG['host']], | |
| 58 | + basic_auth=(ES_CONFIG['username'], ES_CONFIG['password']), | |
| 59 | + verify_certs=False, | |
| 60 | + request_timeout=30 | |
| 61 | + ) | |
| 62 | + return es | |
| 63 | + | |
| 64 | + | |
| 65 | +def get_item_vectors(es, item_id): | |
| 79 | 66 | """ |
| 80 | - features = [] | |
| 81 | - | |
| 82 | - # 添加分类信息(权重最高,重复多次) | |
| 83 | - if pd.notna(row['category_level1']): | |
| 84 | - features.extend([str(row['category_level1'])] * 5) | |
| 85 | - if pd.notna(row['category_level2']): | |
| 86 | - features.extend([str(row['category_level2'])] * 4) | |
| 87 | - if pd.notna(row['category_level3']): | |
| 88 | - features.extend([str(row['category_level3'])] * 3) | |
| 89 | - if pd.notna(row['category_level4']): | |
| 90 | - features.extend([str(row['category_level4'])] * 2) | |
| 67 | + 从ES获取商品的向量数据 | |
| 91 | 68 | |
| 92 | - # 添加供应商信息 | |
| 93 | - if pd.notna(row['supplier_name']): | |
| 94 | - features.extend([str(row['supplier_name'])] * 2) | |
| 95 | - | |
| 96 | - # 添加包装信息 | |
| 97 | - if pd.notna(row['package_type']): | |
| 98 | - features.append(str(row['package_type'])) | |
| 99 | - if pd.notna(row['package_mode']): | |
| 100 | - features.append(str(row['package_mode'])) | |
| 101 | - | |
| 102 | - # 添加商品名称的关键词(简单分词) | |
| 103 | - if pd.notna(row['item_name']): | |
| 104 | - name_words = str(row['item_name']).split() | |
| 105 | - features.extend(name_words) | |
| 106 | - | |
| 107 | - return ' '.join(features) | |
| 69 | + Returns: | |
| 70 | + dict with keys: _id, name_zh, embedding_name_zh, embedding_pic_h14 | |
| 71 | + 或 None if not found | |
| 72 | + """ | |
| 73 | + try: | |
| 74 | + response = es.search( | |
| 75 | + index=ES_CONFIG['index_name'], | |
| 76 | + body={ | |
| 77 | + "query": { | |
| 78 | + "term": { | |
| 79 | + "_id": str(item_id) | |
| 80 | + } | |
| 81 | + }, | |
| 82 | + "_source": { | |
| 83 | + "includes": ["_id", "name_zh", "embedding_name_zh", "embedding_pic_h14"] | |
| 84 | + } | |
| 85 | + } | |
| 86 | + ) | |
| 87 | + | |
| 88 | + if response['hits']['hits']: | |
| 89 | + hit = response['hits']['hits'][0] | |
| 90 | + return { | |
| 91 | + '_id': hit['_id'], | |
| 92 | + 'name_zh': hit['_source'].get('name_zh', ''), | |
| 93 | + 'embedding_name_zh': hit['_source'].get('embedding_name_zh'), | |
| 94 | + 'embedding_pic_h14': hit['_source'].get('embedding_pic_h14') | |
| 95 | + } | |
| 96 | + return None | |
| 97 | + except Exception as e: | |
| 98 | + return None | |
| 108 | 99 | |
| 109 | 100 | |
| 110 | -def calculate_content_similarity(df, top_n=50, logger=None): | |
| 111 | - """ | |
| 112 | - 基于内容计算相似度(内存优化版) | |
| 101 | +def find_similar_by_vector(es, vector, field_name, k=KNN_K, num_candidates=KNN_CANDIDATES): | |
| 113 | 102 | """ |
| 103 | + 使用knn查询找到相似的items | |
| 114 | 104 | |
| 115 | - if logger: | |
| 116 | - logger.info("构建特征文本...") | |
| 117 | - else: | |
| 118 | - print("Building feature texts...") | |
| 119 | - df['feature_text'] = df.apply(build_feature_text, axis=1) | |
| 105 | + Args: | |
| 106 | + es: Elasticsearch客户端 | |
| 107 | + vector: 查询向量 | |
| 108 | + field_name: 向量字段名 (embedding_name_zh 或 embedding_pic_h14.vector) | |
| 109 | + k: 返回的结果数 | |
| 110 | + num_candidates: 候选池大小 | |
| 120 | 111 | |
| 121 | - if logger: | |
| 122 | - logger.info("计算 TF-IDF...") | |
| 123 | - else: | |
| 124 | - print("Calculating TF-IDF...") | |
| 125 | - vectorizer = TfidfVectorizer(max_features=1000) | |
| 126 | - tfidf_matrix = vectorizer.fit_transform(df['feature_text']) | |
| 127 | - | |
| 128 | - if logger: | |
| 129 | - logger.info(f"TF-IDF 矩阵形状: {tfidf_matrix.shape}") | |
| 130 | - logger.info("开始计算余弦相似度(内存优化模式)...") | |
| 131 | - else: | |
| 132 | - print("Calculating cosine similarity...") | |
| 133 | - | |
| 134 | - batch_size = 1000 | |
| 135 | - result = {} | |
| 136 | - | |
| 137 | - for i in range(0, len(df), batch_size): | |
| 138 | - end_i = min(i + batch_size, len(df)) | |
| 139 | - | |
| 140 | - # 分批计算相似度 | |
| 141 | - batch_similarity = cosine_similarity(tfidf_matrix[i:end_i], tfidf_matrix) | |
| 112 | + Returns: | |
| 113 | + List of (item_id, score) tuples | |
| 114 | + """ | |
| 115 | + try: | |
| 116 | + response = es.search( | |
| 117 | + index=ES_CONFIG['index_name'], | |
| 118 | + body={ | |
| 119 | + "knn": { | |
| 120 | + "field": field_name, | |
| 121 | + "query_vector": vector, | |
| 122 | + "k": k, | |
| 123 | + "num_candidates": num_candidates | |
| 124 | + }, | |
| 125 | + "_source": ["_id", "name_zh"], | |
| 126 | + "size": k | |
| 127 | + } | |
| 128 | + ) | |
| 142 | 129 | |
| 143 | - for j, idx in enumerate(range(i, end_i)): | |
| 144 | - item_id = df.iloc[idx]['item_id'] | |
| 145 | - similarities = batch_similarity[j] | |
| 146 | - | |
| 147 | - # 获取最相似的top_n个(排除自己) | |
| 148 | - similar_indices = np.argsort(similarities)[::-1][1:top_n+1] | |
| 149 | - similar_items = [] | |
| 150 | - | |
| 151 | - for sim_idx in similar_indices: | |
| 152 | - if similarities[sim_idx] > 0: # 只保留有相似度的 | |
| 153 | - similar_items.append(( | |
| 154 | - df.iloc[sim_idx]['item_id'], | |
| 155 | - float(similarities[sim_idx]) | |
| 156 | - )) | |
| 157 | - | |
| 158 | - if similar_items: | |
| 159 | - result[item_id] = similar_items | |
| 160 | - | |
| 161 | - if logger: | |
| 162 | - logger.info(f"已处理 {end_i}/{len(df)} 个商品...") | |
| 163 | - else: | |
| 164 | - print(f"Processed {end_i}/{len(df)} products...") | |
| 165 | - | |
| 166 | - return result | |
| 130 | + results = [] | |
| 131 | + for hit in response['hits']['hits']: | |
| 132 | + results.append(( | |
| 133 | + hit['_id'], | |
| 134 | + hit['_score'], | |
| 135 | + hit['_source'].get('name_zh', '') | |
| 136 | + )) | |
| 137 | + return results | |
| 138 | + except Exception as e: | |
| 139 | + return [] | |
| 167 | 140 | |
| 168 | 141 | |
| 169 | -def calculate_category_based_similarity(df): | |
| 142 | +def generate_similarity_index(es, active_items, vector_field, field_name, logger): | |
| 170 | 143 | """ |
| 171 | - 基于分类的相似度(同类目下的商品) | |
| 144 | + 生成一种向量的相似度索引 | |
| 145 | + | |
| 146 | + Args: | |
| 147 | + es: Elasticsearch客户端 | |
| 148 | + active_items: 活跃商品ID列表 | |
| 149 | + vector_field: 向量字段名 (embedding_name_zh 或 embedding_pic_h14) | |
| 150 | + field_name: 字段简称 (name 或 pic) | |
| 151 | + logger: 日志记录器 | |
| 152 | + | |
| 153 | + Returns: | |
| 154 | + dict: {item_id: [(similar_id, score, name), ...]} | |
| 172 | 155 | """ |
| 173 | - result = defaultdict(list) | |
| 156 | + result = {} | |
| 157 | + total = len(active_items) | |
| 174 | 158 | |
| 175 | - # 按四级类目分组 | |
| 176 | - for cat4_id, group in df.groupby('category_level4_id'): | |
| 177 | - if pd.isna(cat4_id) or len(group) < 2: | |
| 159 | + for idx, item_id in enumerate(active_items): | |
| 160 | + if (idx + 1) % 100 == 0: | |
| 161 | + logger.info(f"处理进度: {idx + 1}/{total} ({(idx + 1) / total * 100:.1f}%)") | |
| 162 | + | |
| 163 | + # 获取该商品的向量 | |
| 164 | + item_data = get_item_vectors(es, item_id) | |
| 165 | + if not item_data: | |
| 178 | 166 | continue |
| 179 | 167 | |
| 180 | - items = group['item_id'].tolist() | |
| 181 | - for item_id in items: | |
| 182 | - other_items = [x for x in items if x != item_id] | |
| 183 | - # 同四级类目的商品相似度设为0.9 | |
| 184 | - result[item_id].extend([(x, 0.9) for x in other_items[:50]]) | |
| 185 | - | |
| 186 | - # 按三级类目分组(补充) | |
| 187 | - for cat3_id, group in df.groupby('category_level3_id'): | |
| 188 | - if pd.isna(cat3_id) or len(group) < 2: | |
| 168 | + # 提取向量 | |
| 169 | + if vector_field == 'embedding_name_zh': | |
| 170 | + query_vector = item_data.get('embedding_name_zh') | |
| 171 | + elif vector_field == 'embedding_pic_h14': | |
| 172 | + pic_data = item_data.get('embedding_pic_h14') | |
| 173 | + if pic_data and isinstance(pic_data, list) and len(pic_data) > 0: | |
| 174 | + query_vector = pic_data[0].get('vector') if isinstance(pic_data[0], dict) else None | |
| 175 | + else: | |
| 176 | + query_vector = None | |
| 177 | + else: | |
| 178 | + query_vector = None | |
| 179 | + | |
| 180 | + if not query_vector: | |
| 189 | 181 | continue |
| 190 | 182 | |
| 191 | - items = group['item_id'].tolist() | |
| 192 | - for item_id in items: | |
| 193 | - if item_id not in result or len(result[item_id]) < 50: | |
| 194 | - other_items = [x for x in items if x != item_id] | |
| 195 | - # 同三级类目的商品相似度设为0.7 | |
| 196 | - existing = {x[0] for x in result[item_id]} | |
| 197 | - new_items = [(x, 0.7) for x in other_items if x not in existing] | |
| 198 | - result[item_id].extend(new_items[:50 - len(result[item_id])]) | |
| 183 | + # 使用knn查询相似items(需要排除自己) | |
| 184 | + knn_field = f"{vector_field}.vector" if vector_field == 'embedding_pic_h14' else vector_field | |
| 185 | + similar_items = find_similar_by_vector(es, query_vector, knn_field) | |
| 186 | + | |
| 187 | + # 过滤掉自己,只保留top N | |
| 188 | + filtered_items = [] | |
| 189 | + for sim_id, score, name in similar_items: | |
| 190 | + if sim_id != str(item_id): | |
| 191 | + filtered_items.append((sim_id, score, name)) | |
| 192 | + if len(filtered_items) >= TOP_N: | |
| 193 | + break | |
| 194 | + | |
| 195 | + if filtered_items: | |
| 196 | + result[item_id] = filtered_items | |
| 199 | 197 | |
| 200 | 198 | return result |
| 201 | 199 | |
| 202 | 200 | |
| 203 | -def merge_similarities(sim1, sim2, weight1=0.7, weight2=0.3): | |
| 201 | +def save_index_file(result, es, output_file, logger): | |
| 204 | 202 | """ |
| 205 | - 融合两种相似度 | |
| 203 | + 保存索引文件 | |
| 204 | + | |
| 205 | + 格式: item_id \t item_name \t similar_id1:score1,similar_id2:score2,... | |
| 206 | 206 | """ |
| 207 | - result = {} | |
| 208 | - all_items = set(sim1.keys()) | set(sim2.keys()) | |
| 207 | + logger.info(f"保存索引到: {output_file}") | |
| 209 | 208 | |
| 210 | - for item_id in all_items: | |
| 211 | - similarities = defaultdict(float) | |
| 212 | - | |
| 213 | - # 添加第一种相似度 | |
| 214 | - if item_id in sim1: | |
| 215 | - for similar_id, score in sim1[item_id]: | |
| 216 | - similarities[similar_id] += score * weight1 | |
| 217 | - | |
| 218 | - # 添加第二种相似度 | |
| 219 | - if item_id in sim2: | |
| 220 | - for similar_id, score in sim2[item_id]: | |
| 221 | - similarities[similar_id] += score * weight2 | |
| 222 | - | |
| 223 | - # 排序并取top N | |
| 224 | - sorted_sims = sorted(similarities.items(), key=lambda x: -x[1])[:50] | |
| 225 | - if sorted_sims: | |
| 226 | - result[item_id] = sorted_sims | |
| 209 | + with open(output_file, 'w', encoding='utf-8') as f: | |
| 210 | + for item_id, similar_items in result.items(): | |
| 211 | + if not similar_items: | |
| 212 | + continue | |
| 213 | + | |
| 214 | + # 获取当前商品的名称 | |
| 215 | + item_data = get_item_vectors(es, item_id) | |
| 216 | + item_name = item_data.get('name_zh', 'Unknown') if item_data else 'Unknown' | |
| 217 | + | |
| 218 | + # 格式化相似商品列表 | |
| 219 | + sim_str = ','.join([f'{sim_id}:{score:.4f}' for sim_id, score, _ in similar_items]) | |
| 220 | + f.write(f'{item_id}\t{item_name}\t{sim_str}\n') | |
| 227 | 221 | |
| 228 | - return result | |
| 222 | + logger.info(f"索引保存完成,共 {len(result)} 个商品") | |
| 229 | 223 | |
| 230 | 224 | |
| 231 | 225 | def main(): |
| 232 | - parser = argparse.ArgumentParser(description='Calculate content-based item similarity') | |
| 233 | - parser.add_argument('--top_n', type=int, default=DEFAULT_I2I_TOP_N, | |
| 234 | - help=f'Top N similar items to output (default: {DEFAULT_I2I_TOP_N})') | |
| 235 | - parser.add_argument('--method', type=str, default='hybrid', | |
| 236 | - choices=['tfidf', 'category', 'hybrid'], | |
| 237 | - help='Similarity calculation method') | |
| 238 | - parser.add_argument('--output', type=str, default=None, | |
| 239 | - help='Output file path') | |
| 240 | - parser.add_argument('--debug', action='store_true', | |
| 241 | - help='Enable debug mode with detailed logging and readable output') | |
| 242 | - | |
| 243 | - args = parser.parse_args() | |
| 244 | - | |
| 226 | + """主函数""" | |
| 245 | 227 | # 设置logger |
| 246 | - logger = setup_debug_logger('i2i_content_similar', debug=args.debug) | |
| 228 | + logger = setup_debug_logger('i2i_content_similar', debug=True) | |
| 247 | 229 | |
| 248 | - # 记录算法参数 | |
| 249 | - params = { | |
| 250 | - 'top_n': args.top_n, | |
| 251 | - 'method': args.method, | |
| 252 | - 'debug': args.debug | |
| 253 | - } | |
| 254 | - log_algorithm_params(logger, params) | |
| 230 | + logger.info("="*80) | |
| 231 | + logger.info("开始生成基于ES向量的内容相似索引") | |
| 232 | + logger.info(f"ES地址: {ES_CONFIG['host']}") | |
| 233 | + logger.info(f"索引名: {ES_CONFIG['index_name']}") | |
| 234 | + logger.info(f"Top N: {TOP_N}") | |
| 235 | + logger.info("="*80) | |
| 255 | 236 | |
| 256 | 237 | # 创建数据库连接 |
| 257 | - logger.info("连接数据库...") | |
| 238 | + log_processing_step(logger, "连接数据库") | |
| 258 | 239 | engine = create_db_connection( |
| 259 | 240 | DB_CONFIG['host'], |
| 260 | 241 | DB_CONFIG['port'], |
| ... | ... | @@ -263,72 +244,41 @@ def main(): |
| 263 | 244 | DB_CONFIG['password'] |
| 264 | 245 | ) |
| 265 | 246 | |
| 266 | - # 获取商品特征 | |
| 267 | - log_processing_step(logger, "获取商品特征") | |
| 268 | - df = fetch_product_features(engine) | |
| 269 | - logger.info(f"获取到 {len(df)} 个商品的特征数据") | |
| 270 | - log_dataframe_info(logger, df, "商品特征数据") | |
| 247 | + # 获取活跃商品 | |
| 248 | + log_processing_step(logger, "获取最近1年有过行为的商品") | |
| 249 | + active_items = get_active_items(engine) | |
| 250 | + logger.info(f"找到 {len(active_items)} 个活跃商品") | |
| 271 | 251 | |
| 272 | - # 计算相似度 | |
| 273 | - log_processing_step(logger, f"计算相似度 (方法: {args.method})") | |
| 274 | - if args.method == 'tfidf': | |
| 275 | - logger.info("使用 TF-IDF 方法...") | |
| 276 | - result = calculate_content_similarity(df, args.top_n, logger=logger) | |
| 277 | - elif args.method == 'category': | |
| 278 | - logger.info("使用基于分类的方法...") | |
| 279 | - result = calculate_category_based_similarity(df) | |
| 280 | - else: # hybrid | |
| 281 | - logger.info("使用混合方法 (TF-IDF 70% + 分类 30%)...") | |
| 282 | - tfidf_sim = calculate_content_similarity(df, args.top_n, logger=logger) | |
| 283 | - logger.info("计算基于分类的相似度...") | |
| 284 | - category_sim = calculate_category_based_similarity(df) | |
| 285 | - logger.info("合并相似度...") | |
| 286 | - result = merge_similarities(tfidf_sim, category_sim, weight1=0.7, weight2=0.3) | |
| 252 | + # 连接ES | |
| 253 | + log_processing_step(logger, "连接Elasticsearch") | |
| 254 | + es = connect_es() | |
| 255 | + logger.info("ES连接成功") | |
| 287 | 256 | |
| 288 | - logger.info(f"为 {len(result)} 个物品生成了相似度") | |
| 257 | + # 生成两份相似度索引 | |
| 258 | + date_str = datetime.now().strftime("%Y%m%d") | |
| 289 | 259 | |
| 290 | - # 输出结果 | |
| 291 | - log_processing_step(logger, "保存结果") | |
| 292 | - output_file = args.output or os.path.join( | |
| 293 | - OUTPUT_DIR, | |
| 294 | - f'i2i_content_{args.method}_{datetime.now().strftime("%Y%m%d")}.txt' | |
| 260 | + # 1. 基于名称文本向量 | |
| 261 | + log_processing_step(logger, "生成基于名称文本向量的相似索引") | |
| 262 | + name_result = generate_similarity_index( | |
| 263 | + es, active_items, 'embedding_name_zh', 'name', logger | |
| 295 | 264 | ) |
| 265 | + name_output = os.path.join(OUTPUT_DIR, f'i2i_content_name_{date_str}.txt') | |
| 266 | + save_index_file(name_result, es, name_output, logger) | |
| 296 | 267 | |
| 297 | - # 获取name mappings | |
| 298 | - name_mappings = {} | |
| 299 | - if args.debug: | |
| 300 | - logger.info("获取物品名称映射...") | |
| 301 | - name_mappings = fetch_name_mappings(engine, debug=True) | |
| 302 | - | |
| 303 | - logger.info(f"写入结果到 {output_file}...") | |
| 304 | - with open(output_file, 'w', encoding='utf-8') as f: | |
| 305 | - for item_id, sims in result.items(): | |
| 306 | - # 使用name_mappings获取名称 | |
| 307 | - item_name = name_mappings.get(item_id, 'Unknown') | |
| 308 | - if item_name == 'Unknown' and 'item_name' in df.columns: | |
| 309 | - item_name = df[df['item_id'] == item_id]['item_name'].iloc[0] if len(df[df['item_id'] == item_id]) > 0 else 'Unknown' | |
| 310 | - | |
| 311 | - if not sims: | |
| 312 | - continue | |
| 313 | - | |
| 314 | - # 格式:item_id \t item_name \t similar_item_id1:score1,similar_item_id2:score2,... | |
| 315 | - sim_str = ','.join([f'{sim_id}:{score:.4f}' for sim_id, score in sims]) | |
| 316 | - f.write(f'{item_id}\t{item_name}\t{sim_str}\n') | |
| 317 | - | |
| 318 | - logger.info(f"完成!为 {len(result)} 个物品生成了基于内容的相似度") | |
| 319 | - logger.info(f"输出保存到:{output_file}") | |
| 268 | + # 2. 基于图片向量 | |
| 269 | + log_processing_step(logger, "生成基于图片向量的相似索引") | |
| 270 | + pic_result = generate_similarity_index( | |
| 271 | + es, active_items, 'embedding_pic_h14', 'pic', logger | |
| 272 | + ) | |
| 273 | + pic_output = os.path.join(OUTPUT_DIR, f'i2i_content_pic_{date_str}.txt') | |
| 274 | + save_index_file(pic_result, es, pic_output, logger) | |
| 320 | 275 | |
| 321 | - # 如果启用debug模式,保存可读格式 | |
| 322 | - if args.debug: | |
| 323 | - log_processing_step(logger, "保存Debug可读格式") | |
| 324 | - save_readable_index( | |
| 325 | - output_file, | |
| 326 | - result, | |
| 327 | - name_mappings, | |
| 328 | - description=f'i2i:content:{args.method}' | |
| 329 | - ) | |
| 276 | + logger.info("="*80) | |
| 277 | + logger.info("完成!生成了两份内容相似索引:") | |
| 278 | + logger.info(f" 1. 名称向量索引: {name_output} ({len(name_result)} 个商品)") | |
| 279 | + logger.info(f" 2. 图片向量索引: {pic_output} ({len(pic_result)} 个商品)") | |
| 280 | + logger.info("="*80) | |
| 330 | 281 | |
| 331 | 282 | |
| 332 | 283 | if __name__ == '__main__': |
| 333 | 284 | main() |
| 334 | - | ... | ... |
offline_tasks/scripts/load_index_to_redis.py
| ... | ... | @@ -81,7 +81,7 @@ def load_i2i_indices(redis_client, date_str=None, expire_days=7): |
| 81 | 81 | expire_seconds = expire_days * 24 * 3600 if expire_days else None |
| 82 | 82 | |
| 83 | 83 | # i2i索引类型 |
| 84 | - i2i_types = ['swing', 'session_w2v', 'deepwalk'] | |
| 84 | + i2i_types = ['swing', 'session_w2v', 'deepwalk', 'content_name', 'content_pic'] | |
| 85 | 85 | |
| 86 | 86 | for i2i_type in i2i_types: |
| 87 | 87 | file_path = os.path.join(OUTPUT_DIR, f'i2i_{i2i_type}_{date_str}.txt') | ... | ... |
| ... | ... | @@ -0,0 +1,266 @@ |
| 1 | +""" | |
| 2 | +测试Elasticsearch连接和向量查询 | |
| 3 | +用于验证ES配置和向量字段是否正确 | |
| 4 | +""" | |
| 5 | +import sys | |
| 6 | +import os | |
| 7 | +sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))) | |
| 8 | + | |
| 9 | +from elasticsearch import Elasticsearch | |
| 10 | +import json | |
| 11 | + | |
| 12 | +# ES配置 | |
| 13 | +ES_CONFIG = { | |
| 14 | + 'host': 'http://localhost:9200', | |
| 15 | + 'index_name': 'spu', | |
| 16 | + 'username': 'essa', | |
| 17 | + 'password': '4hOaLaf41y2VuI8y' | |
| 18 | +} | |
| 19 | + | |
| 20 | + | |
| 21 | +def test_connection(): | |
| 22 | + """测试ES连接""" | |
| 23 | + print("="*80) | |
| 24 | + print("测试Elasticsearch连接") | |
| 25 | + print("="*80) | |
| 26 | + | |
| 27 | + try: | |
| 28 | + es = Elasticsearch( | |
| 29 | + [ES_CONFIG['host']], | |
| 30 | + basic_auth=(ES_CONFIG['username'], ES_CONFIG['password']), | |
| 31 | + verify_certs=False, | |
| 32 | + request_timeout=30 | |
| 33 | + ) | |
| 34 | + | |
| 35 | + # 测试连接 | |
| 36 | + info = es.info() | |
| 37 | + print(f"✓ ES连接成功!") | |
| 38 | + print(f" 集群名称: {info['cluster_name']}") | |
| 39 | + print(f" 版本: {info['version']['number']}") | |
| 40 | + | |
| 41 | + return es | |
| 42 | + except Exception as e: | |
| 43 | + print(f"✗ ES连接失败: {e}") | |
| 44 | + return None | |
| 45 | + | |
| 46 | + | |
| 47 | +def test_index_exists(es): | |
| 48 | + """测试索引是否存在""" | |
| 49 | + print("\n" + "="*80) | |
| 50 | + print("测试索引是否存在") | |
| 51 | + print("="*80) | |
| 52 | + | |
| 53 | + try: | |
| 54 | + exists = es.indices.exists(index=ES_CONFIG['index_name']) | |
| 55 | + if exists: | |
| 56 | + print(f"✓ 索引 '{ES_CONFIG['index_name']}' 存在") | |
| 57 | + | |
| 58 | + # 获取索引统计 | |
| 59 | + stats = es.count(index=ES_CONFIG['index_name']) | |
| 60 | + print(f" 文档数量: {stats['count']}") | |
| 61 | + else: | |
| 62 | + print(f"✗ 索引 '{ES_CONFIG['index_name']}' 不存在") | |
| 63 | + return False | |
| 64 | + | |
| 65 | + return True | |
| 66 | + except Exception as e: | |
| 67 | + print(f"✗ 查询索引失败: {e}") | |
| 68 | + return False | |
| 69 | + | |
| 70 | + | |
| 71 | +def test_mapping(es): | |
| 72 | + """测试向量字段映射""" | |
| 73 | + print("\n" + "="*80) | |
| 74 | + print("测试向量字段映射") | |
| 75 | + print("="*80) | |
| 76 | + | |
| 77 | + try: | |
| 78 | + mapping = es.indices.get_mapping(index=ES_CONFIG['index_name']) | |
| 79 | + properties = mapping[ES_CONFIG['index_name']]['mappings']['properties'] | |
| 80 | + | |
| 81 | + # 检查关键字段 | |
| 82 | + fields_to_check = ['name_zh', 'embedding_name_zh', 'embedding_pic_h14'] | |
| 83 | + | |
| 84 | + for field in fields_to_check: | |
| 85 | + if field in properties: | |
| 86 | + field_type = properties[field].get('type', properties[field]) | |
| 87 | + print(f"✓ 字段 '{field}' 存在") | |
| 88 | + if isinstance(field_type, dict): | |
| 89 | + print(f" 类型: {json.dumps(field_type, indent=2)}") | |
| 90 | + else: | |
| 91 | + print(f" 类型: {field_type}") | |
| 92 | + else: | |
| 93 | + print(f"✗ 字段 '{field}' 不存在") | |
| 94 | + | |
| 95 | + return True | |
| 96 | + except Exception as e: | |
| 97 | + print(f"✗ 获取mapping失败: {e}") | |
| 98 | + return False | |
| 99 | + | |
| 100 | + | |
| 101 | +def test_query_item(es, item_id="3302275"): | |
| 102 | + """测试查询商品向量""" | |
| 103 | + print("\n" + "="*80) | |
| 104 | + print(f"测试查询商品 {item_id}") | |
| 105 | + print("="*80) | |
| 106 | + | |
| 107 | + try: | |
| 108 | + response = es.search( | |
| 109 | + index=ES_CONFIG['index_name'], | |
| 110 | + body={ | |
| 111 | + "query": { | |
| 112 | + "term": { | |
| 113 | + "_id": item_id | |
| 114 | + } | |
| 115 | + }, | |
| 116 | + "_source": { | |
| 117 | + "includes": ["_id", "name_zh", "embedding_name_zh", "embedding_pic_h14"] | |
| 118 | + } | |
| 119 | + } | |
| 120 | + ) | |
| 121 | + | |
| 122 | + if response['hits']['hits']: | |
| 123 | + hit = response['hits']['hits'][0] | |
| 124 | + print(f"✓ 找到商品 {item_id}") | |
| 125 | + print(f" 名称: {hit['_source'].get('name_zh', 'N/A')}") | |
| 126 | + | |
| 127 | + # 检查向量 | |
| 128 | + name_vector = hit['_source'].get('embedding_name_zh') | |
| 129 | + if name_vector: | |
| 130 | + print(f" 名称向量维度: {len(name_vector)}") | |
| 131 | + print(f" 名称向量示例: {name_vector[:5]}...") | |
| 132 | + else: | |
| 133 | + print(" ✗ 名称向量不存在") | |
| 134 | + | |
| 135 | + pic_data = hit['_source'].get('embedding_pic_h14') | |
| 136 | + if pic_data and isinstance(pic_data, list) and len(pic_data) > 0: | |
| 137 | + pic_vector = pic_data[0].get('vector') if isinstance(pic_data[0], dict) else None | |
| 138 | + if pic_vector: | |
| 139 | + print(f" 图片向量维度: {len(pic_vector)}") | |
| 140 | + print(f" 图片向量示例: {pic_vector[:5]}...") | |
| 141 | + else: | |
| 142 | + print(" ✗ 图片向量不存在") | |
| 143 | + else: | |
| 144 | + print(" ✗ 图片数据不存在") | |
| 145 | + | |
| 146 | + return hit['_source'] | |
| 147 | + else: | |
| 148 | + print(f"✗ 未找到商品 {item_id}") | |
| 149 | + return None | |
| 150 | + except Exception as e: | |
| 151 | + print(f"✗ 查询商品失败: {e}") | |
| 152 | + return None | |
| 153 | + | |
| 154 | + | |
| 155 | +def test_knn_query(es, item_id="3302275"): | |
| 156 | + """测试KNN向量查询""" | |
| 157 | + print("\n" + "="*80) | |
| 158 | + print(f"测试KNN查询(商品 {item_id})") | |
| 159 | + print("="*80) | |
| 160 | + | |
| 161 | + # 先获取该商品的向量 | |
| 162 | + item_data = test_query_item(es, item_id) | |
| 163 | + if not item_data: | |
| 164 | + print("无法获取商品向量,跳过KNN测试") | |
| 165 | + return False | |
| 166 | + | |
| 167 | + # 测试名称向量KNN查询 | |
| 168 | + name_vector = item_data.get('embedding_name_zh') | |
| 169 | + if name_vector: | |
| 170 | + try: | |
| 171 | + print("\n测试名称向量KNN查询...") | |
| 172 | + response = es.search( | |
| 173 | + index=ES_CONFIG['index_name'], | |
| 174 | + body={ | |
| 175 | + "knn": { | |
| 176 | + "field": "embedding_name_zh", | |
| 177 | + "query_vector": name_vector, | |
| 178 | + "k": 5, | |
| 179 | + "num_candidates": 10 | |
| 180 | + }, | |
| 181 | + "_source": ["_id", "name_zh"], | |
| 182 | + "size": 5 | |
| 183 | + } | |
| 184 | + ) | |
| 185 | + | |
| 186 | + print(f"✓ 名称向量KNN查询成功") | |
| 187 | + print(f" 找到 {len(response['hits']['hits'])} 个相似商品:") | |
| 188 | + for idx, hit in enumerate(response['hits']['hits'], 1): | |
| 189 | + print(f" {idx}. ID: {hit['_id']}, 名称: {hit['_source'].get('name_zh', 'N/A')}, 分数: {hit['_score']:.4f}") | |
| 190 | + except Exception as e: | |
| 191 | + print(f"✗ 名称向量KNN查询失败: {e}") | |
| 192 | + | |
| 193 | + # 测试图片向量KNN查询 | |
| 194 | + pic_data = item_data.get('embedding_pic_h14') | |
| 195 | + if pic_data and isinstance(pic_data, list) and len(pic_data) > 0: | |
| 196 | + pic_vector = pic_data[0].get('vector') if isinstance(pic_data[0], dict) else None | |
| 197 | + if pic_vector: | |
| 198 | + try: | |
| 199 | + print("\n测试图片向量KNN查询...") | |
| 200 | + response = es.search( | |
| 201 | + index=ES_CONFIG['index_name'], | |
| 202 | + body={ | |
| 203 | + "knn": { | |
| 204 | + "field": "embedding_pic_h14.vector", | |
| 205 | + "query_vector": pic_vector, | |
| 206 | + "k": 5, | |
| 207 | + "num_candidates": 10 | |
| 208 | + }, | |
| 209 | + "_source": ["_id", "name_zh"], | |
| 210 | + "size": 5 | |
| 211 | + } | |
| 212 | + ) | |
| 213 | + | |
| 214 | + print(f"✓ 图片向量KNN查询成功") | |
| 215 | + print(f" 找到 {len(response['hits']['hits'])} 个相似商品:") | |
| 216 | + for idx, hit in enumerate(response['hits']['hits'], 1): | |
| 217 | + print(f" {idx}. ID: {hit['_id']}, 名称: {hit['_source'].get('name_zh', 'N/A')}, 分数: {hit['_score']:.4f}") | |
| 218 | + except Exception as e: | |
| 219 | + print(f"✗ 图片向量KNN查询失败: {e}") | |
| 220 | + | |
| 221 | + return True | |
| 222 | + | |
| 223 | + | |
| 224 | +def main(): | |
| 225 | + """主函数""" | |
| 226 | + print("\n" + "="*80) | |
| 227 | + print("Elasticsearch向量查询测试") | |
| 228 | + print("="*80) | |
| 229 | + | |
| 230 | + # 1. 测试连接 | |
| 231 | + es = test_connection() | |
| 232 | + if not es: | |
| 233 | + return 1 | |
| 234 | + | |
| 235 | + # 2. 测试索引 | |
| 236 | + if not test_index_exists(es): | |
| 237 | + return 1 | |
| 238 | + | |
| 239 | + # 3. 测试mapping | |
| 240 | + test_mapping(es) | |
| 241 | + | |
| 242 | + # 4. 测试查询商品 | |
| 243 | + # 默认测试ID,如果不存在会失败,用户可以修改为实际的商品ID | |
| 244 | + test_item_id = "3302275" | |
| 245 | + print(f"\n提示: 如果商品ID {test_item_id} 不存在,请修改 test_item_id 变量为实际的商品ID") | |
| 246 | + | |
| 247 | + item_data = test_query_item(es, test_item_id) | |
| 248 | + | |
| 249 | + # 5. 测试KNN查询 | |
| 250 | + if item_data: | |
| 251 | + test_knn_query(es, test_item_id) | |
| 252 | + | |
| 253 | + print("\n" + "="*80) | |
| 254 | + print("测试完成!") | |
| 255 | + print("="*80) | |
| 256 | + print("\n如果所有测试都通过,可以运行:") | |
| 257 | + print(" python scripts/i2i_content_similar.py") | |
| 258 | + print("\n") | |
| 259 | + | |
| 260 | + return 0 | |
| 261 | + | |
| 262 | + | |
| 263 | +if __name__ == '__main__': | |
| 264 | + import sys | |
| 265 | + sys.exit(main()) | |
| 266 | + | ... | ... |