使用 jq 加入和过滤 JSON 个文件

Question

我正在与 jq 一起处理 Yelp json 语料库，拼命地尝试完成一些连接和过滤任务。 business.json 包含类别和 business_id，从中我可以得到餐厅的所有 ID，我想使用它来过滤 review.json 以提取餐厅的所有评论。

在 RDBMS 中听起来很简单，但我想学习 jq 方法。

有人能帮忙吗？

我尝试过的东西。

提取的企业 ID 并保存在 id.txt 中。但是在jq.
在脚本中循环所有 id 并执行 jq --arg id $line '. | select( .business_id | contains($id))' reviews.json
加入两个 json 文件也许是可能的，但我不愿意这样做，因为文件的大小 (~1G)

根据评论编辑：

简化样本输入： business.json

{

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",

"full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", > >

"categories": ["Restaurant"]

}

reviews.json

{

"date": "2012-05-15",

"text": "Got a letter in the mail last week that said Dr. Goldberg is moving to Arizona to take a new position there in June. He will be missed very much. \n\nI think finding a new doctor in NYC that you actually like might almost be as awful as trying to find a date!",

"type": "review",

"business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }

最佳尝试：能够使用多个 id 来 march 文档，比如

jq '. | select( .business_id | contains("LRKJF43s9-3jG9Lgx4zODg", "uGykseHzyS5xAMWoN6YUqA"))' reviews.json

但是无法用变量替换查询字符串，

jq --arg t vcNAWiLM4dR7D2nwwJ7nCA '. | select( .business_id | contains(env.t))' reviews.json 无效

Answer 1

根据您的描述，我不清楚每项业务和每条评论是否都是顶级对象。但是，您似乎可以安排将业务和评论都显示为流，因此在下文中，我将假设：

(a) both reviews.json and businesses.json are files of JSON objects;
(b) it is acceptable to read all the reviews into memory.

（如果相反，只接受将业务读入内存，下面可以轻松修改。）

逻辑是：读取所有评论，然后针对每家餐厅，提取该餐厅的评论。

select(.categories | index("Restaurant"))
| .business_id as $business_id
| $reviews[]
| select( .type == "review" and .business_id == $business_id)

调用：

$ jq --slurpfile reviews reviews.json yelp.jq businesses.json

请注意 --slurpfile 选项在 jq 1.4 中不可用。

（如果 reviews.json 已经是 JSON 对象的数组，那么您可以使用 --argfile reviews reviews.json，因此不需要 jq 1.5。）

使用 jq 加入和过滤 JSON 个文件

Join and filter JSON files using jq

json

jq