data_crawling/README.md

# 电商数据爬虫工具集
通过万邦API爬取电商平台（Shopee、Amazon）商品数据的工具集。
## 目录
- [Shopee爬虫](#shopee爬虫)
- [Amazon爬虫](#amazon爬虫)
- [通用说明](#通用说明)
---
# Shopee爬虫
通过万邦API爬取Shopee商品数据的简化脚本。
## 文件说明
```
data_crawling/
├── shopee_crawler.py       # 主爬虫脚本
├── test_crawler.py         # 测试脚本（爬取前5个）
├── queries.txt             # 查询词列表（5024个）
├── 万邦API_shopee.md       # API文档
├── shopee_results/         # 爬取结果目录
└── test_results/           # 测试结果目录
```
## 快速开始
### 1. 安装依赖
```bash
pip install requests
```
### 2. 测试运行（推荐先测试）
```bash
cd /home/tw/SearchEngine/data_crawling
python test_crawler.py
```
这会爬取前5个查询词，结果保存在 `test_results/` 目录。
### 3. 完整运行
```bash
python shopee_crawler.py
```
爬取全部5024个查询词，结果保存在 `shopee_results/` 目录。
## 配置参数
在脚本顶部可以修改：
```python
COUNTRY = '.com.my'  # 站点: .vn, .co.th, .tw, .co.id, .sg, .com.my
PAGE = 1             # 页码
DELAY = 2            # 请求间隔（秒）
MAX_RETRIES = 3      # 最大重试次数
```
## 输出结果
### 文件命名
```
0001_Bohemian_Maxi_Dress_20231204_143025.json
0002_Vintage_Denim_Jacket_20231204_143028.json
...
```
### JSON格式
```json
{
  "items": {
    "keyword": "dress",
    "total_results": 5000,
    "item": [
      {
        "title": "商品标题",
        "pic_url": "图片URL",
        "price": 38.9,
        "sales": 293,
        "shop_id": "277113808",
        "detail_url": "商品链接"
      }
    ]
  },
  "error_code": "0000",
  "reason": "ok"
}
```
### 摘要文件
`summary.json` 包含爬取统计：
```json
{
  "crawl_time": "2023-12-04T14:30:25",
  "total": 5024,
  "success": 5000,
  "fail": 24,
  "elapsed_seconds": 10248,
  "failed_queries": ["query1", "query2"]
}
```
## 预估时间
- 单个查询: ~4秒（含2秒延迟）
- 全部5024个: ~5.6小时
## API信息
- **地址**: `https://api-gw.onebound.cn/shopee/item_search`
- **Key**: `t8618339029`
- **Secret**: `9029f568`
- **每日限额**: 10,000次
## 常用命令
```bash
# 测试（前5个）
python test_crawler.py
# 完整爬取
python shopee_crawler.py
# 查看结果数量
ls shopee_results/*.json | wc -l
# 查看失败的查询
cat shopee_results/failed_queries.txt
# 查看摘要
cat shopee_results/summary.json
```
## 注意事项
1. **API限额**: 每日最多10,000次调用
2. **请求间隔**: 建议保持2秒以上
3. **网络稳定**: 需要稳定的网络连接
4. **存储空间**: 确保有足够的磁盘空间（约2-3GB）
## 故障排除
**问题**: API返回错误
- 检查网络连接
- 确认API配额未用完
- 查看 `failed_queries.txt`
**问题**: 文件编码错误
- 脚本已使用UTF-8编码，无需额外设置
**问题**: 中断后继续
- 可以删除已爬取的JSON文件对应的查询词，重新运行
---
# Amazon爬虫
通过万邦API爬取Amazon商品数据的脚本。
## 快速开始
### 1. 配置API密钥
#### 方法1：使用配置文件（推荐）
```bash
cd /home/tw/SearchEngine/data_crawling
cp config.example.py config.py
# 编辑 config.py，填入你的API密钥
```
#### 方法2：使用命令行参数
```bash
python amazon_crawler_v2.py --key YOUR_KEY --secret YOUR_SECRET
```
#### 方法3：使用环境变量
```bash
export ONEBOUND_API_KEY="your_key_here"
export ONEBOUND_API_SECRET="your_secret_here"
```
### 2. 测试API连接
```bash
python test_api.py
```
这会测试API连接是否正常，并显示配额信息。
### 3. 开始爬取
#### 测试模式（前10个查询）
```bash
python amazon_crawler_v2.py --max 10
```
#### 完整爬取（全部5024个查询）
```bash
python amazon_crawler_v2.py
```
## 脚本说明
### amazon_crawler.py
基础版爬虫，需要在代码中配置API密钥。
### amazon_crawler_v2.py（推荐）
增强版爬虫，支持：
- 配置文件/命令行参数/环境变量
- 详细的日志输出
- 进度显示
- 统计信息
- 断点续爬
### test_api.py
API连接测试工具，用于验证：
- API密钥是否正确
- 网络连接是否正常
- API配额是否充足
### analyze_results.py
结果分析工具，用于：
- 统计爬取结果
- 分析商品数据（价格、评分等）
- 导出CSV格式
## 命令行参数
```bash
python amazon_crawler_v2.py [选项]
选项:
  --key KEY          API Key
  --secret SECRET    API Secret
  --queries FILE     查询文件路径（默认：queries.txt）
  --delay SECONDS    请求间隔（默认：2.0秒）
  --start INDEX      起始索引，用于断点续爬（默认：0）
  --max NUM          最大爬取数量（默认：全部）
  --output DIR       结果保存目录（默认：amazon_results）
```
## 使用示例
### 测试前10个查询
```bash
python amazon_crawler_v2.py --max 10
```
### 从第100个查询继续爬取
```bash
python amazon_crawler_v2.py --start 100
```
### 使用自定义延迟
```bash
python amazon_crawler_v2.py --delay 1.5
```
### 分析爬取结果
```bash
# 显示统计信息
python analyze_results.py
# 导出为CSV
python analyze_results.py --csv
# 指定结果目录
python analyze_results.py --dir amazon_results
```
## 输出结果
### 文件命名
```
0001_Bohemian_Maxi_Dress.json
0002_Vintage_Denim_Jacket.json
0003_Minimalist_Linen_Trousers.json
...
```
### JSON格式
```json
{
  "items": {
    "item": [
      {
        "detail_url": "https://www.amazon.com/...",
        "num_iid": "B07F8S18D5",
        "pic_url": "https://...",
        "price": "9.99",
        "reviews": "53812",
        "sales": 10000,
        "stars": "4.7",
        "title": "商品标题"
      }
    ],
    "page": "1",
    "page_size": 100,
    "real_total_results": 700,
    "q": "搜索词"
  },
  "error_code": "0000",
  "reason": "ok"
}
```
### 日志文件
运行日志保存在 `amazon_crawler.log`：
```
2025-01-07 10:00:00 - INFO - Amazon爬虫启动
2025-01-07 10:00:01 - INFO - [1/5024] (0.0%) - Bohemian Maxi Dress
2025-01-07 10:00:02 - INFO - ✓ 成功: Bohemian Maxi Dress - 获得 700 个结果
...
```
### 分析报告
运行分析工具后生成 `analysis_report.json`：
```json
{
  "total_files": 5024,
  "successful": 5000,
  "failed": 24,
  "success_rate": "99.5%",
  "total_items": 50000,
  "avg_items_per_query": 10.0
}
```
## 预估时间
- 单个查询: ~4秒（含2秒延迟）
- 全部5024个: ~5.6小时
## 注意事项
1. **API限额**: 请注意API的每日/每月配额限制
2. **请求间隔**: 建议保持2秒以上的延迟
3. **断点续爬**: 使用 `--start` 参数可以从指定位置继续
4. **磁盘空间**: 确保有足够的存储空间（约2-3GB）
## 详细文档
更多详细信息，请查看：
- [Amazon爬虫详细文档](AMAZON_CRAWLER_README.md)
- [API文档](万邦API_亚马逊.md)
---
# 通用说明
## 文件结构
```
data_crawling/
├── shopee_crawler.py          # Shopee爬虫
├── test_crawler.py            # Shopee测试脚本
├── amazon_crawler.py          # Amazon爬虫（基础版）
├── amazon_crawler_v2.py       # Amazon爬虫（增强版）
├── test_api.py                # API测试工具
├── analyze_results.py         # 结果分析工具
├── config.example.py          # 配置文件示例
├── queries.txt                # 查询词列表（5024个）
├── 万邦API_shopee.md          # Shopee API文档
├── 万邦API_亚马逊.md          # Amazon API文档
├── AMAZON_CRAWLER_README.md   # Amazon爬虫详细文档
├── shopee_results/            # Shopee结果目录
├── amazon_results/            # Amazon结果目录
└── test_results/              # 测试结果目录
```
## 依赖安装
```bash
pip install requests
```
## 常见问题
### 1. API密钥错误
**错误**: `错误: 未配置API密钥！`
**解决**: 
- 检查配置文件是否正确
- 使用 `test_api.py` 测试连接
- 确认密钥没有多余的空格或引号
### 2. 请求超时
**错误**: `请求失败: timeout`
**解决**:
- 检查网络连接
- 尝试增加延迟时间
- 检查API服务是否正常
### 3. API配额耗尽
**错误**: `quota exceeded`
**解决**:
- 查看日志中的 `api_info` 字段
- 等待配额重置
- 使用 `--max` 参数限制爬取数量
### 4. 中断后继续爬取
使用 `--start` 参数指定起始位置：
```bash
# 从第1000个查询继续
python amazon_crawler_v2.py --start 1000
```
## 许可证
本项目仅供学习和研究使用。使用API时请遵守相应平台的服务条款。