# 电商数据爬虫工具集

通过万邦API爬取电商平台（Shopee、Amazon）商品数据的工具集。

## 目录

- [Shopee爬虫](#shopee爬虫)
- [Amazon爬虫](#amazon爬虫)
- [通用说明](#通用说明)

---

# Shopee爬虫

通过万邦API爬取Shopee商品数据的简化脚本。

## 文件说明

```
data_crawling/
├── shopee_crawler.py       # 主爬虫脚本
├── test_crawler.py         # 测试脚本（爬取前5个）
├── queries.txt             # 查询词列表（5024个）
├── 万邦API_shopee.md       # API文档
├── shopee_results/         # 爬取结果目录
└── test_results/           # 测试结果目录
```

## 快速开始

### 1. 安装依赖

```bash
pip install requests
```

### 2. 测试运行（推荐先测试）

```bash
cd /home/tw/SearchEngine/data_crawling
python test_crawler.py
```

这会爬取前5个查询词，结果保存在 `test_results/` 目录。

### 3. 完整运行

```bash
python shopee_crawler.py
```

爬取全部5024个查询词，结果保存在 `shopee_results/` 目录。

## 配置参数

在脚本顶部可以修改：

```python
COUNTRY = '.com.my'  # 站点: .vn, .co.th, .tw, .co.id, .sg, .com.my
PAGE = 1             # 页码
DELAY = 2            # 请求间隔（秒）
MAX_RETRIES = 3      # 最大重试次数
```

## 输出结果

### 文件命名

```
0001_Bohemian_Maxi_Dress_20231204_143025.json
0002_Vintage_Denim_Jacket_20231204_143028.json
...
```

### JSON格式

```json
{
  "items": {
    "keyword": "dress",
    "total_results": 5000,
    "item": [
      {
        "title": "商品标题",
        "pic_url": "图片URL",
        "price": 38.9,
        "sales": 293,
        "shop_id": "277113808",
        "detail_url": "商品链接"
      }
    ]
  },
  "error_code": "0000",
  "reason": "ok"
}
```

### 摘要文件

`summary.json` 包含爬取统计：

```json
{
  "crawl_time": "2023-12-04T14:30:25",
  "total": 5024,
  "success": 5000,
  "fail": 24,
  "elapsed_seconds": 10248,
  "failed_queries": ["query1", "query2"]
}
```

## 预估时间

- 单个查询: ~4秒（含2秒延迟）
- 全部5024个: ~5.6小时

## API信息

- **地址**: `https://api-gw.onebound.cn/shopee/item_search`
- **Key**: `t8618339029`
- **Secret**: `9029f568`
- **每日限额**: 10,000次

## 常用命令

```bash
# 测试（前5个）
python test_crawler.py

# 完整爬取
python shopee_crawler.py

# 查看结果数量
ls shopee_results/*.json | wc -l

# 查看失败的查询
cat shopee_results/failed_queries.txt

# 查看摘要
cat shopee_results/summary.json
```

## 注意事项

1. **API限额**: 每日最多10,000次调用
2. **请求间隔**: 建议保持2秒以上
3. **网络稳定**: 需要稳定的网络连接
4. **存储空间**: 确保有足够的磁盘空间（约2-3GB）

## 故障排除

**问题**: API返回错误

- 检查网络连接
- 确认API配额未用完
- 查看 `failed_queries.txt`

**问题**: 文件编码错误

- 脚本已使用UTF-8编码，无需额外设置

**问题**: 中断后继续

- 可以删除已爬取的JSON文件对应的查询词，重新运行

---

# Amazon爬虫

通过万邦API爬取Amazon商品数据的脚本。

## 快速开始

### 1. 配置API密钥

#### 方法1：使用配置文件（推荐）

```bash
cd /home/tw/SearchEngine/data_crawling
cp config.example.py config.py
# 编辑 config.py，填入你的API密钥
```

#### 方法2：使用命令行参数

```bash
python amazon_crawler_v2.py --key YOUR_KEY --secret YOUR_SECRET
```

#### 方法3：使用环境变量

```bash
export ONEBOUND_API_KEY="your_key_here"
export ONEBOUND_API_SECRET="your_secret_here"
```

### 2. 测试API连接

```bash
python test_api.py
```

这会测试API连接是否正常，并显示配额信息。

### 3. 开始爬取

#### 测试模式（前10个查询）

```bash
python amazon_crawler_v2.py --max 10
```

#### 完整爬取（全部5024个查询）

```bash
python amazon_crawler_v2.py
```

## 脚本说明

### amazon_crawler.py
基础版爬虫，需要在代码中配置API密钥。

### amazon_crawler_v2.py（推荐）
增强版爬虫，支持：
- 配置文件/命令行参数/环境变量
- 详细的日志输出
- 进度显示
- 统计信息
- 断点续爬

### test_api.py
API连接测试工具，用于验证：
- API密钥是否正确
- 网络连接是否正常
- API配额是否充足

### analyze_results.py
结果分析工具，用于：
- 统计爬取结果
- 分析商品数据（价格、评分等）
- 导出CSV格式

## 命令行参数

```bash
python amazon_crawler_v2.py [选项]

选项:
  --key KEY          API Key
  --secret SECRET    API Secret
  --queries FILE     查询文件路径（默认：queries.txt）
  --delay SECONDS    请求间隔（默认：2.0秒）
  --start INDEX      起始索引，用于断点续爬（默认：0）
  --max NUM          最大爬取数量（默认：全部）
  --output DIR       结果保存目录（默认：amazon_results）
```

## 使用示例

### 测试前10个查询

```bash
python amazon_crawler_v2.py --max 10
```

### 从第100个查询继续爬取

```bash
python amazon_crawler_v2.py --start 100
```

### 使用自定义延迟

```bash
python amazon_crawler_v2.py --delay 1.5
```

### 分析爬取结果

```bash
# 显示统计信息
python analyze_results.py

# 导出为CSV
python analyze_results.py --csv

# 指定结果目录
python analyze_results.py --dir amazon_results
```

## 输出结果

### 文件命名

```
0001_Bohemian_Maxi_Dress.json
0002_Vintage_Denim_Jacket.json
0003_Minimalist_Linen_Trousers.json
...
```

### JSON格式

```json
{
  "items": {
    "item": [
      {
        "detail_url": "https://www.amazon.com/...",
        "num_iid": "B07F8S18D5",
        "pic_url": "https://...",
        "price": "9.99",
        "reviews": "53812",
        "sales": 10000,
        "stars": "4.7",
        "title": "商品标题"
      }
    ],
    "page": "1",
    "page_size": 100,
    "real_total_results": 700,
    "q": "搜索词"
  },
  "error_code": "0000",
  "reason": "ok"
}
```

### 日志文件

运行日志保存在 `amazon_crawler.log`：

```
2025-01-07 10:00:00 - INFO - Amazon爬虫启动
2025-01-07 10:00:01 - INFO - [1/5024] (0.0%) - Bohemian Maxi Dress
2025-01-07 10:00:02 - INFO - ✓ 成功: Bohemian Maxi Dress - 获得 700 个结果
...
```

### 分析报告

运行分析工具后生成 `analysis_report.json`：

```json
{
  "total_files": 5024,
  "successful": 5000,
  "failed": 24,
  "success_rate": "99.5%",
  "total_items": 50000,
  "avg_items_per_query": 10.0
}
```

## 预估时间

- 单个查询: ~4秒（含2秒延迟）
- 全部5024个: ~5.6小时

## 注意事项

1. **API限额**: 请注意API的每日/每月配额限制
2. **请求间隔**: 建议保持2秒以上的延迟
3. **断点续爬**: 使用 `--start` 参数可以从指定位置继续
4. **磁盘空间**: 确保有足够的存储空间（约2-3GB）

## 详细文档

更多详细信息，请查看：
- [Amazon爬虫详细文档](AMAZON_CRAWLER_README.md)
- [API文档](万邦API_亚马逊.md)

---

# 通用说明

## 文件结构

```
data_crawling/
├── shopee_crawler.py          # Shopee爬虫
├── test_crawler.py            # Shopee测试脚本
├── amazon_crawler.py          # Amazon爬虫（基础版）
├── amazon_crawler_v2.py       # Amazon爬虫（增强版）
├── test_api.py                # API测试工具
├── analyze_results.py         # 结果分析工具
├── config.example.py          # 配置文件示例
├── queries.txt                # 查询词列表（5024个）
├── 万邦API_shopee.md          # Shopee API文档
├── 万邦API_亚马逊.md          # Amazon API文档
├── AMAZON_CRAWLER_README.md   # Amazon爬虫详细文档
├── shopee_results/            # Shopee结果目录
├── amazon_results/            # Amazon结果目录
└── test_results/              # 测试结果目录
```

## 依赖安装

```bash
pip install requests
```

## 常见问题

### 1. API密钥错误

**错误**: `错误: 未配置API密钥！`

**解决**: 
- 检查配置文件是否正确
- 使用 `test_api.py` 测试连接
- 确认密钥没有多余的空格或引号

### 2. 请求超时

**错误**: `请求失败: timeout`

**解决**:
- 检查网络连接
- 尝试增加延迟时间
- 检查API服务是否正常

### 3. API配额耗尽

**错误**: `quota exceeded`

**解决**:
- 查看日志中的 `api_info` 字段
- 等待配额重置
- 使用 `--max` 参数限制爬取数量

### 4. 中断后继续爬取

使用 `--start` 参数指定起始位置：

```bash
# 从第1000个查询继续
python amazon_crawler_v2.py --start 1000
```

## 许可证

本项目仅供学习和研究使用。使用API时请遵守相应平台的服务条款。