将 JSON 数组拆分为单独的 files/objects
Split JSON array into separate files/objects
我 JSON 以这种格式从 Cassandra 导出。
[
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
},
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}]
我想把它拆分成单独的文档:
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
}
和
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID:
Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}
我想使用 jq 但没有找到方法。
请问如何用文档分隔符拆分?
谢谢,雷迪
如果您有一个包含 2 个对象的数组:
jq '.[0]' input.json > doc1.json && jq '.[1]' input.json > doc2.json
结果:
$ head -n100 doc[12].json
==> doc1.json <==
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
}
==> doc2.json <==
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}
您可以使用 Python 更有效地做到这一点(因为您可以读取整个输入一次,而不是每个文档一次):
import json
docs = json.load(open('in.json'))
for ii, doc in enumerate(docs):
with open('doc{}.json'.format(ii), 'w') as out:
json.dump(doc, out, indent=2)
使用 jq,可以使用过滤器将数组拆分为其组件:
.[]
然后问题就变成了每个组件要做什么。如果你想将每个组件指向一个单独的文件,你可以(例如)使用 jq 和 -c 选项,并将结果过滤到 awk 中,然后 awk 可以将组件分配到不同的文件。参见例如Split JSON File Objects Into Multiple Files
性能考虑
有人可能认为调用jq+awk的开销比调用python要高,但是jq和awk相比python+json都是轻量级的,因为这些时间建议(使用 Python 2.7.10):
time (jq -c .[] input.json | awk '{print > "doc00" NR ".json";}')
user 0m0.005s
sys 0m0.008s
time python split.py
user 0m0.016s
sys 0m0.046s
要将包含许多记录的 json 拆分为所需大小的块,我只需使用:
jq -c '.[0:1000]' mybig.json
这类似于 python 切片。
在此处查看文档:https://stedolan.github.io/jq/manual/
Array/String Slice: .[10:15]
The .[10:15] syntax can be used to return
a subarray of an array or substring of a string. The array returned by
.[10:15] will be of length 5, containing the elements from index 10
(inclusive) to index 15 (exclusive). Either index may be negative (in
which case it counts backwards from the end of the array), or omitted
(in which case it refers to the start or end of the array).
一种方法是使用 jq 的流选项并将其通过管道传递给拆分命令
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' bigfile.json | split -l $num_of_elements_in_a_file - big_part
每个文件的行数根据您输入 num_of_elements_in_a_file、
的值而变化
你可以看看这个答案
参考此页面讨论如何使用流解析器
https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser
我 JSON 以这种格式从 Cassandra 导出。
[
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
},
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}]
我想把它拆分成单独的文档:
{ "correlationId": "2232845a8556cd3219e46ab8", "leg": 0, "tag": "received", "offset": 263128, "len": 30, "prev": { "page": { "file": 0, "page": 0 }, "record": 0 }, "data": "HEAD /healthcheck HTTP/1.1\r\n\r\n" }
和
{ "correlationId": "2232845a8556cd3219e46ab8", "leg": 0, "tag": "sent", "offset": 262971, "len": 157, "prev": { "page": { "file": 10330, "page": 6 }, "record": 1271 }, "data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n" }
我想使用 jq 但没有找到方法。
请问如何用文档分隔符拆分?
谢谢,雷迪
如果您有一个包含 2 个对象的数组:
jq '.[0]' input.json > doc1.json && jq '.[1]' input.json > doc2.json
结果:
$ head -n100 doc[12].json
==> doc1.json <==
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "received",
"offset": 263128,
"len": 30,
"prev": {
"page": {
"file": 0,
"page": 0
},
"record": 0
},
"data": "HEAD /healthcheck HTTP/1.1\r\n\r\n"
}
==> doc2.json <==
{
"correlationId": "2232845a8556cd3219e46ab8",
"leg": 0,
"tag": "sent",
"offset": 262971,
"len": 157,
"prev": {
"page": {
"file": 10330,
"page": 6
},
"record": 1271
},
"data": "HTTP/1.1 200 OK\r\nDate: Wed, 14 Feb 2018 12:57:06 GMT\r\nServer: \r\nConnection: close\r\nX-CorrelationID: Id-2232845a8556cd3219e46ab8 0\r\nContent-Type: text/xml\r\n\r\n"
}
您可以使用 Python 更有效地做到这一点(因为您可以读取整个输入一次,而不是每个文档一次):
import json
docs = json.load(open('in.json'))
for ii, doc in enumerate(docs):
with open('doc{}.json'.format(ii), 'w') as out:
json.dump(doc, out, indent=2)
使用 jq,可以使用过滤器将数组拆分为其组件:
.[]
然后问题就变成了每个组件要做什么。如果你想将每个组件指向一个单独的文件,你可以(例如)使用 jq 和 -c 选项,并将结果过滤到 awk 中,然后 awk 可以将组件分配到不同的文件。参见例如Split JSON File Objects Into Multiple Files
性能考虑
有人可能认为调用jq+awk的开销比调用python要高,但是jq和awk相比python+json都是轻量级的,因为这些时间建议(使用 Python 2.7.10):
time (jq -c .[] input.json | awk '{print > "doc00" NR ".json";}')
user 0m0.005s
sys 0m0.008s
time python split.py
user 0m0.016s
sys 0m0.046s
要将包含许多记录的 json 拆分为所需大小的块,我只需使用:
jq -c '.[0:1000]' mybig.json
这类似于 python 切片。
在此处查看文档:https://stedolan.github.io/jq/manual/
Array/String Slice: .[10:15]
The .[10:15] syntax can be used to return a subarray of an array or substring of a string. The array returned by .[10:15] will be of length 5, containing the elements from index 10 (inclusive) to index 15 (exclusive). Either index may be negative (in which case it counts backwards from the end of the array), or omitted (in which case it refers to the start or end of the array).
一种方法是使用 jq 的流选项并将其通过管道传递给拆分命令
jq -cn --stream 'fromstream(1|truncate_stream(inputs))' bigfile.json | split -l $num_of_elements_in_a_file - big_part
每个文件的行数根据您输入 num_of_elements_in_a_file、
的值而变化你可以看看这个答案