大型 json 文件的 jq 流式处理以仅获取其属性具有特定值的对象

Question

我有一些相当大的 json 文件（~500mb - 4gb 压缩），我无法将其加载到内存中进行操作。所以我在 jq.

中使用 --stream 选项

例如我的 json 可能看起来像这样 - 只是更大：

[{
  "id": "0001",
  "type": "donut",
  "name": "Cake",
  "ppu": 0.55,
  "batters": {
    "batter": [{
      "id": "1001",
      "type": "Regular"
    }, {
      "id": "1002",
      "type": "Chocolate"
    }, {
      "id": "1003",
      "type": "Blueberry"
    }, {
      "id": "1004",
      "type": "Devil's Food"
    }]
  },
  "topping": [{
    "id": "5001",
    "type": "None"
  }, {
    "id": "5002",
    "type": "Glazed"
  }, {
    "id": "5005",
    "type": "Sugar"
  }, {
    "id": "5007",
    "type": "Powdered Sugar"
  }, {
    "id": "5006",
    "type": "Chocolate with Sprinkles"
  }, {
    "id": "5003",
    "type": "Chocolate"
  }, {
    "id": "5004",
    "type": "Maple"
  }]
}, {
  "id": "0002",
  "type": "donut",
  "name": "Raised",
  "ppu": 0.55,
  "batters": {
    "batter": [{
      "id": "1001",
      "type": "Regular"
    }]
  },
  "topping": [{
    "id": "5001",
    "type": "None"
  }, {
    "id": "5002",
    "type": "Glazed"
  }, {
    "id": "5005",
    "type": "Sugar"
  }, {
    "id": "5003",
    "type": "Chocolate"
  }, {
    "id": "5004",
    "type": "Maple"
  }]
}, {
  "id": "0003",
  "type": "donut",
  "name": "Old Fashioned",
  "ppu": 0.55,
  "batters": {
    "batter": [{
      "id": "1001",
      "type": "Regular"
    }, {
      "id": "1002",
      "type": "Chocolate"
    }]
  },
  "topping": [{
    "id": "5001",
    "type": "None"
  }, {
    "id": "5002",
    "type": "Glazed"
  }, {
    "id": "5003",
    "type": "Chocolate"
  }, {
    "id": "5004",
    "type": "Maple"
  }]
}]

如果这是我可以保存在内存中的文件类型，并且我想要 select 只有面糊类型 "Chocolate" 的对象，我可以使用：

cat sample.json | jq '.[] | select(.batters.batter[].type == "Chocolate")'

而且我只会取回 ID 为 "0001" 和 "0003"

的完整对象

但是我知道流媒体是不同的。

我正在阅读有关流式传输的 jq 文档 here and here，但我仍然很困惑，因为这些示例并没有真正展示 json 的真实世界问题。

也就是说，是否有可能 select 整个对象在流过它们的路径并识别一个值得注意的事件之后，或者在这种情况下属性与某个字符串匹配的值？

我知道我可以使用：

cat sample.json | jq --stream 'select(.[0][1] == "batters" and .[0][2] == "batter" and .[0][4] == "type") | .[1]'

给我所有的击球手类型。但是有没有办法说："If it's Chocolate, grab the object this leaf is a part of"?

Answer 1

命令：

$ jq -cn --stream 'fromstream(1|truncate_stream(inputs))' array_of_objects.json | 
  jq 'select(.batters.batter[].type == "Chocolate") | .id'

输出：

"0001"
"0003"

第一次调用 jq 将对象数组转换为对象流。第二个基于您的调用，可以根据您的需要进一步定制。

当然这两个调用可以（而且可能应该）合并为一个，但您可能希望使用第一个调用将大文件另存为包含对象流的文件。

顺便说一下，使用下面的 select 可能会更好：

select( any(.batters.batter[]; .type == "Chocolate") )

Answer 2

这是另一种方法。从一个流过滤器 filter1.jq 开始，它提取记录号和您需要处理的最小属性集。例如

  select(length==2)
| . as [$p, $v]
| {r:$p[0]}
| if   $p[1] == "id"                           then .id   = $v
  elif $p[1] == "batters" and $p[-1] == "type" then .type = $v
  else  empty
  end

运行这个跟

jq -M -c --stream -f filter1.jq bigdata.json

产生类似

的值

{"r":0,"id":"0001"}
{"r":0,"type":"Regular"}
{"r":0,"type":"Chocolate"}
{"r":0,"type":"Blueberry"}
{"r":0,"type":"Devil's Food"}
{"r":1,"id":"0002"}
{"r":1,"type":"Regular"}
{"r":2,"id":"0003"}
{"r":2,"type":"Regular"}
{"r":2,"type":"Chocolate"}

现在将其通过管道传输到第二个过滤器 filter2.jq，它对每条记录的这些属性进行您想要的处理

foreach .[] as $i (
     {c: null, r:null, id:null, type:null}

   ; .c = $i
   | if .r != .c.r then .id=null | .type=null | .r=.c.r else . end   # control break
   | .id   = if .c.id == null   then .id   else .c.id   end
   | .type = if .c.type == null then .type else .c.type end

   ; if ([.id, .type] | contains([null])) then empty else . end
)
| select(.type == "Chocolate").id

使用像

这样的命令

jq -M -c --stream -f filter1.jq bigdata.json | jq -M -s -r -f filter2.jq

生产

0001
0003

filter1.jq 和 filter2.jq 比您解决这个特定问题所需的多一点，但它们可以很容易地概括。

大型 json 文件的 jq 流式处理以仅获取其属性具有特定值的对象

jq streaming of large json files to get only objects whose properties have a specific value

json

mapreduce

jq