使用流从大型 JSON 中提取顶级密钥和内容

Question

系统中的一个过程是'extract'一个键及其（对象）值到一个专用文件，以便随后在（不相关的）脚本中以某种方式处理它。

原始 JSON 文件的代表性子集如下所示：

{
  "version" : null,
  "produced" : "2021-01-01T00:00:00+0000",
  "other": "content here",
  "items" : [
    {
      "code" : "AA",
      "name" : "Example 1",
      "prices" : [ "other", "content", "here" ]
    }, 
    {
      "code" : "BB",
      "name" : "Example 2",
      "prices" : [ "other", "content", "here" ]
    }
  ]
}

当前输出，给定该子集作为输入，简单地等于：

[
    {
      "code" : "AA",
      "name" : "Example 1",
      "prices" : [ "other", "content", "here" ],
    }, 
    {
      "code" : "BB",
      "name" : "Example 2",
      "prices" : [ "other", "content", "here" ],
    }, 
    ...
]

以前，我们会使用 jq 和一个非常简单的命令（效果很好）来提取“项目”的整个部分：

cat file.json | jq '.items' > file.items.json

但是，最近原始 json 文件的大小急剧增加，导致脚本因 内存不足 错误而失败。一个明显的解决方案是使用 jq 的 'stream' 选项。但是，我对如何将上述命令转换为 jq 流语法中的有效过滤器有些困惑。

cat file.json | jq --stream '...' > file.items.json

任何关于使用什么作为此命令的过滤器的建议将不胜感激。提前致谢！

Answer 1

您应该将 --stream 标志与 fromstream 内置

结合使用

jq --stream --null-input '
  fromstream(inputs | select(.[0][0] == "items"))[]
' file.json

[
  {
    "code": "AA",
    "name": "Example 1",
    "prices": [
      "other",
      "content",
      "here"
    ]
  },
  {
    "code": "BB",
    "name": "Example 2",
    "prices": [
      "other",
      "content",
      "here"
    ]
  }
]

Demo 不是为了效率或内存消耗，而是为了语法（因为我不得不使用 tostream 流式传输您的原始输入，因为缺少 --stream 选项 jqplay.org)

注意：虽然它适用于示例数据，但不要尝试使用快捷方式

jq --stream --null-input 'fromstream(inputs).items' file.json

直接在你的大 JSON 文件上，因为它只

reconstructs the entire input JSON entity, thus defeating the purpose of using --stream

（由澄清）

Answer 2

如果 {code, name, prices} 对象的流是可以接受的，那么你可以选择：

< input.json jq --stream -n '
   fromstream( 2 | truncate_stream(inputs | select(.[0][0] == "items")) )'

这将有最小的内存需求，这可能会或可能不会很重要，具体取决于 .items|length

的值

使用流从大型 JSON 中提取顶级密钥和内容

Extract top-level key and contents from large JSON using stream

json

extract

stream

jq