有没有办法使用 jq 通过其公共键拆分 JSON 文件?
Is there a way to use jq to split a JSON file by its common keys?
我有一组很多股票的定价数据(大约 110 万行)。
我在解析内存中的所有这些数据时遇到问题,因此我想按股票代码将其拆分为单独的文件,并仅在需要时导入数据。
发件人:
stockprices.json
收件人:
AAPL.json
ACN.json
...
等等
stockprices.json 目前有这个结构:
[{
"date": "2016-03-22 00:00:00",
"symbol": "ACN",
"open": "121.029999",
"close": "121.470001",
"low": "120.720001",
"high": "122.910004",
"volume": "711400.0"
},
{
"date": "2016-03-23 00:00:00",
"symbol": "AAPL",
"open": "121.470001",
"close": "119.379997",
"low": "119.099998",
"high": "121.470001",
"volume": "444200.0"
},
{
"date": "2016-03-24 00:00:00",
"symbol": "AAPL",
"open": "118.889999",
"close": "119.410004",
"low": "117.639999",
"high": "119.440002",
"volume": "534100.0"
},
...{}....]
我相信 jq 是完成这项工作的正确工具,但我无法理解它。
如何获取上面的数据并使用 jq 按符号字段拆分它?
例如,我想结束:
AAPL.json:
[{
"date": "2016-03-23 00:00:00",
"symbol": "AAPL",
"open": "121.470001",
"close": "119.379997",
"low": "119.099998",
"high": "121.470001",
"volume": "444200.0"
},
{
"date": "2016-03-24 00:00:00",
"symbol": "AAPL",
"open": "118.889999",
"close": "119.410004",
"low": "117.639999",
"high": "119.440002",
"volume": "534100.0"
}]
和ACN.json:
[{
"date": "2016-03-22 00:00:00",
"symbol": "ACN",
"open": "121.029999",
"close": "121.470001",
"low": "120.720001",
"high": "122.910004",
"volume": "711400.0"
},
{
"date": "2016-03-22 00:00:00",
"symbol": "ACN",
"open": "121.029999",
"close": "121.470001",
"low": "120.720001",
"high": "122.910004",
"volume": "711400.0"
}
]
您可以使用一点 shell 循环:
#!/bin/bash
jq -r '.[].symbol' stockprices.json | while read -r symbol ; do
jq --arg s "${symbol}" \
'map(if .symbol == $s then . else empty end)' \
stockprices.json > "${symbol}".json
done
这是假设您的 RAM 足够大的一次性解决方案。该解决方案避免使用 group_by
,因为这需要进行排序操作,这是不必要的,并且在时间和内存方面可能成本很高。
为了创建输出文件,此处使用 awk
以提高效率,但对方法而言并不重要。
split.jq
def aggregate_by(s; f; g):
reduce s as $x (null; .[$x|f] += [$x|g]);
aggregate_by(.[]; .symbol; .)
| keys_unsorted[] as $k
| $k, .[$k]
使用 awk 调用
jq -f split.jq stockprices.json | awk '
substr([=11=],1,1) == "\"" {
if (fn) {close(fn)};
gsub(/^"|"$/,"",[=11=]); fn=[=11=] ".json"; next;
}
{print >> fn}'
您需要一个循环,但它可以在一次调用中完成:
jq -rc 'group_by(.symbol)[] | "\(.[0].symbol)\t\(.)"' stockprices.json |
while IFS=$'\t' read -r symbol content; do
echo "${content}" > "${symbol}.json"
done
我有一组很多股票的定价数据(大约 110 万行)。
我在解析内存中的所有这些数据时遇到问题,因此我想按股票代码将其拆分为单独的文件,并仅在需要时导入数据。
发件人:
stockprices.json
收件人:
AAPL.json
ACN.json
...
等等
stockprices.json 目前有这个结构:
[{
"date": "2016-03-22 00:00:00",
"symbol": "ACN",
"open": "121.029999",
"close": "121.470001",
"low": "120.720001",
"high": "122.910004",
"volume": "711400.0"
},
{
"date": "2016-03-23 00:00:00",
"symbol": "AAPL",
"open": "121.470001",
"close": "119.379997",
"low": "119.099998",
"high": "121.470001",
"volume": "444200.0"
},
{
"date": "2016-03-24 00:00:00",
"symbol": "AAPL",
"open": "118.889999",
"close": "119.410004",
"low": "117.639999",
"high": "119.440002",
"volume": "534100.0"
},
...{}....]
我相信 jq 是完成这项工作的正确工具,但我无法理解它。
如何获取上面的数据并使用 jq 按符号字段拆分它?
例如,我想结束:
AAPL.json:
[{
"date": "2016-03-23 00:00:00",
"symbol": "AAPL",
"open": "121.470001",
"close": "119.379997",
"low": "119.099998",
"high": "121.470001",
"volume": "444200.0"
},
{
"date": "2016-03-24 00:00:00",
"symbol": "AAPL",
"open": "118.889999",
"close": "119.410004",
"low": "117.639999",
"high": "119.440002",
"volume": "534100.0"
}]
和ACN.json:
[{
"date": "2016-03-22 00:00:00",
"symbol": "ACN",
"open": "121.029999",
"close": "121.470001",
"low": "120.720001",
"high": "122.910004",
"volume": "711400.0"
},
{
"date": "2016-03-22 00:00:00",
"symbol": "ACN",
"open": "121.029999",
"close": "121.470001",
"low": "120.720001",
"high": "122.910004",
"volume": "711400.0"
}
]
您可以使用一点 shell 循环:
#!/bin/bash
jq -r '.[].symbol' stockprices.json | while read -r symbol ; do
jq --arg s "${symbol}" \
'map(if .symbol == $s then . else empty end)' \
stockprices.json > "${symbol}".json
done
这是假设您的 RAM 足够大的一次性解决方案。该解决方案避免使用 group_by
,因为这需要进行排序操作,这是不必要的,并且在时间和内存方面可能成本很高。
为了创建输出文件,此处使用 awk
以提高效率,但对方法而言并不重要。
split.jq
def aggregate_by(s; f; g):
reduce s as $x (null; .[$x|f] += [$x|g]);
aggregate_by(.[]; .symbol; .)
| keys_unsorted[] as $k
| $k, .[$k]
使用 awk 调用
jq -f split.jq stockprices.json | awk '
substr([=11=],1,1) == "\"" {
if (fn) {close(fn)};
gsub(/^"|"$/,"",[=11=]); fn=[=11=] ".json"; next;
}
{print >> fn}'
您需要一个循环,但它可以在一次调用中完成:
jq -rc 'group_by(.symbol)[] | "\(.[0].symbol)\t\(.)"' stockprices.json |
while IFS=$'\t' read -r symbol content; do
echo "${content}" > "${symbol}.json"
done