如何将整个elasticsearch索引数据导出到一个csv文件中
How to export the whole elasticsearch index data into a csv file
我打算获取弹性搜索索引中的所有行,然后将这些行存储为 CSV 文件。但是,我尝试过的大多数方法最终都会出现大小限制错误。
curl -k -u username:password -XGET "https://xx.xx.xx.xx:xxxx/foo-index/_search?scroll=10m"
-H 'Content-Type: application/json'
-d'{ "from": 0, "size": 933963, "query" : { "match_all" : {} }, "track_total_hits": true, "_source": ["foo_id"]}'
显示的错误是:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Batch size is too large, size must be less than or equal to: [10000] but was [933963]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"foo-index","node":"k0OUtLDFRye4gIXGKCKLmQ","reason":{"type":"illegal_argument_exception","reason":"Batch size is too large, size must be less than or equal to: [10000] but was [933963]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."}}]
问题是我无法减小大小,因为我需要在索引中获取全部内容。
您遇到异常是因为 Elasticsearch 的大小有 10k 的限制。
您可以使用 search after api 获取您将设置 size:10000
的所有文档,因此它将被多次调用(每次调用将获得 10k 文档)以从索引中获取所有数据。
要使用 PIT,您需要先使用以下命令生成 PTI ID:
POST /my-index-000001/_pit?keep_alive=1m
API return 是 PIT ID。
{
"id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
}
您可以使用 PIT ID,如下面的查询所示:
GET /_search
{
"size": 10000,
"query": {
"match" : {
"user.id" : "elkbee"
}
},
"pit": {
"id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
"keep_alive": "1m"
},
"sort": [
{"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" }}
]
}
上面的查询将 return 您的第一个 10k 文档,作为响应,您将获得新的 PIT ID,您可以将其传递给相同的查询以获取下一组 10k 文档。
{
"pit_id" : "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
"took" : 17,
"timed_out" : false,
"_shards" : ...,
"hits" : {
"total" : ...,
"max_score" : null,
"hits" : [
...
{
"_index" : "my-index-000001",
"_id" : "FaslK3QBySSL_rrj9zM5",
"_score" : null,
"_source" : ...,
"sort" : [
"2021-05-20T05:30:04.832Z",
4294967298
]
}
]
}
}
PS:没有解决方案可以在单个 API 调用中为您提供超过 10k 的文档。您需要使用 search_after or scroll API.
我打算获取弹性搜索索引中的所有行,然后将这些行存储为 CSV 文件。但是,我尝试过的大多数方法最终都会出现大小限制错误。
curl -k -u username:password -XGET "https://xx.xx.xx.xx:xxxx/foo-index/_search?scroll=10m"
-H 'Content-Type: application/json'
-d'{ "from": 0, "size": 933963, "query" : { "match_all" : {} }, "track_total_hits": true, "_source": ["foo_id"]}'
显示的错误是:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Batch size is too large, size must be less than or equal to: [10000] but was [933963]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"foo-index","node":"k0OUtLDFRye4gIXGKCKLmQ","reason":{"type":"illegal_argument_exception","reason":"Batch size is too large, size must be less than or equal to: [10000] but was [933963]. Scroll batch sizes cost as much memory as result windows so they are controlled by the [index.max_result_window] index level setting."}}]
问题是我无法减小大小,因为我需要在索引中获取全部内容。
您遇到异常是因为 Elasticsearch 的大小有 10k 的限制。
您可以使用 search after api 获取您将设置 size:10000
的所有文档,因此它将被多次调用(每次调用将获得 10k 文档)以从索引中获取所有数据。
要使用 PIT,您需要先使用以下命令生成 PTI ID:
POST /my-index-000001/_pit?keep_alive=1m
API return 是 PIT ID。
{
"id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA=="
}
您可以使用 PIT ID,如下面的查询所示:
GET /_search
{
"size": 10000,
"query": {
"match" : {
"user.id" : "elkbee"
}
},
"pit": {
"id": "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
"keep_alive": "1m"
},
"sort": [
{"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" }}
]
}
上面的查询将 return 您的第一个 10k 文档,作为响应,您将获得新的 PIT ID,您可以将其传递给相同的查询以获取下一组 10k 文档。
{
"pit_id" : "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
"took" : 17,
"timed_out" : false,
"_shards" : ...,
"hits" : {
"total" : ...,
"max_score" : null,
"hits" : [
...
{
"_index" : "my-index-000001",
"_id" : "FaslK3QBySSL_rrj9zM5",
"_score" : null,
"_source" : ...,
"sort" : [
"2021-05-20T05:30:04.832Z",
4294967298
]
}
]
}
}
PS:没有解决方案可以在单个 API 调用中为您提供超过 10k 的文档。您需要使用 search_after or scroll API.