Reindex_from_remote_注意事项.md 6.51 KB
Edit Raw Blame History


Reindex from Remote 注意事项（官方文档要点）
基于 Elasticsearch 官方文档整理的 Reindex from remote 要点，用于从远程 ES 集群（如 8.x）迁移数据到本机集群（如 9.x）。
官方文档入口

Reindex API (current)
Reindex from remote (upgrade)
REST API: Reindex


必须注意的事项
1. 目标集群白名单（必须）

If reindexing from a remote cluster into a cluster using Elastic Stack, you must explicitly allow the remote host using the reindex.remote.whitelist node setting on the destination cluster.


在目标集群（本机 ES） 的 elasticsearch.yml 中配置，不是源集群。
格式：只写 host:port，多个用逗号分隔；不写协议。例如：
yaml
reindex.remote.whitelist: "120.76.41.98:9200"

修改后需重启执行 reindex 的节点（通常是协调节点）才能生效。

2. 目标索引需事先创建

Reindex 不会复制源索引的 settings / mappings。
目标索引的 mapping、分片数、副本数等需在调用 _reindex 之前在本机创建好。
可用本项目的 mappings/search_products.json 在本机创建同名索引。

3. 权限要求

源集群（远程） 用于认证的用户（如 source.remote.username）需要：


集群权限：monitor
源索引权限：read

目标集群（本机） 执行 reindex 的用户需要：


目标索引：write
若需自动创建目标索引：auto_configure 或 create_index 或 manage


4. 源文档必须开启 _source

Reindex 依赖文档的 _source 字段；若源索引禁用了 _source，无法 reindex。

5. 远程 Reindex 不支持 Slicing

文档明确说明：Reindexing from remote clusters does not support manual or automatic slicing.
不能通过 slices 或 slice 做并行加速，只能单任务拉取。

6. 远程拉取时的缓冲区与 batch size

从远程 reindex 时，目标集群使用 on-heap buffer，默认最大约 100MB。
若单文档很大，需在 source 里调小 size（每批文档数），例如 "size": 500 或 200，避免 OOM。
默认 size 为 1000。

7. max_docs 与 conflicts

用 max_docs 可限制只迁移前 N 条（注意：与 scroll 顺序不保证严格一致，但数量正确）。
若设置 conflicts: "proceed"，在遇到版本冲突时仍会继续，但可能从源多读一些文档直到成功写入 max_docs 条。

8. 建议在源索引为 green 时执行

官方建议在源索引状态为 green 时 reindex，否则节点宕机等可能导致失败。
若使用 wait_for_completion=false，可通过 Task API 查进度；重试时可能需要先删掉目标索引中部分数据或设置 conflicts=proceed。

9. 超时（可选）

source.remote.socket_timeout、connect_timeout 可调大，默认约 30s；大批量或网络慢时可适当增加。


示例：从远程 tenant_170 同步 10000 条到本机 tenant_0

源：远程 search_products_tenant_170（约 39731 条）
目标：本机索引 search_products_tenant_0，只同步 10000 条

步骤 1：在本机 ES 配置白名单
在本机 ES 的 elasticsearch.yml 中添加（或合并到已有 reindex.remote.whitelist）：
reindex.remote.whitelist: "120.76.41.98:9200"


保存后重启本机 ES（或至少重启会执行 reindex 的节点）。
步骤 2：在本机创建目标索引
使用本项目 mapping 创建索引 search_products_tenant_0（若已存在且结构一致可跳过）。例如用 API：
# 本机 ES（按需加 -u user:pass）
curl -X PUT 'http://localhost:9200/search_products_tenant_0?pretty' \
  -H 'Content-Type: application/json' \
  -d @mappings/search_products.json


或通过项目代码：create_index_if_not_exists(es_client, "search_products_tenant_0", load_mapping())。
步骤 3：在本机执行 Reindex（请求发往本机 ES）
以下请求是发给本机 ES（例如 http://localhost:9200），由本机去拉远程数据。
# 请求发往本机 ES（ES 9.x 将 wait_for_completion 放在 query 参数）
curl -X POST 'http://localhost:9200/_reindex?wait_for_completion=true&pretty' \
  -H 'Content-Type: application/json' \
  -d '{
  "max_docs": 10000,
  "source": {
    "remote": {
      "host": "http://120.76.41.98:9200",
      "username": "essa",
      "password": "4hOaLaf41y2VuI8y"
    },
    "index": "search_products_tenant_170",
    "size": 500
  },
  "dest": {
    "index": "search_products_tenant_0"
  }
}'


说明：


max_docs: 10000：最多写入 10000 条到目标。
source.remote：远程 ES 地址与认证（仅本机连远程时使用，不会把密码发到远程）。
source.index：远程索引名。
source.size：每批从远程拉取的文档数，500 可降低大文档时本机内存压力。
dest.index：本机目标索引名。
wait_for_completion: true：同步等待完成；数据量大可改为 false，用返回的 task_id 查进度：GET _tasks/<task_id>。

步骤 4：校验条数
curl -X GET 'http://localhost:9200/search_products_tenant_0/_count?pretty' \
  -H 'Content-Type: application/json' \
  -d '{"query":{"match_all":{}}}'


预期约 10000 条（若未设 max_docs 则会与源索引条数一致）。
一键脚本（可选）
项目内提供了脚本，可自动创建目标索引并执行上述 reindex（默认 10000 条，目标 search_products_tenant_0）：
# 确保本机 .env 中 ES_HOST 指向本机 ES（如 http://localhost:9200）
chmod +x scripts/reindex_from_remote_tenant_170_to_0.sh
./scripts/reindex_from_remote_tenant_170_to_0.sh


可通过环境变量覆盖：REMOTE_ES_HOST、REMOTE_ES_USER、REMOTE_ES_PASS、MAX_DOCS、LOCAL_ES_HOST。详见脚本注释。


小结


项目
说明


白名单
在目标（本机） ES 的 elasticsearch.yml 中配置 reindex.remote.whitelist


目标索引
事先在本机创建好 mapping/settings


远程权限
源集群用户需 monitor + 源索引 read


限条数
使用 max_docs，例如 10000


大批/大文档
适当调小 source.size（如 500）


并行
远程 reindex 不支持 slices