如何使用 python 或 jq 将具有统一列的多个 JSON 文件合并为一个 CSV？

Question

我正在努力使主要来源美国法律更符合美国残疾人法案 (ADA)，更符合我的研究。

技术问题在主题中。

我正在使用的源文件可以在这里找到：https://www.courtlistener.com/api/bulk-data/opinions/wash.tar.gz

该档案中的示例 JSON 文件（其中包含 52,565 条华盛顿州最高法院的司法意见），

987095.json 是一个包含以下“元素”的示例（我想将元素名称作为列名传递，同时为记录计数提供新的 ID；而不是源数据库 ID 列）：

使用 Dadroit Viewer 1.5 Build 1935.appimage 查看 JSON 数据对我来说最简单。

我提取了以下数据（底部 opinions_cited 有 [] 括号，其中包含 0 - 28；我希望每个 JSON 都有不同数量的 opinions_cited ；任何人从 0 到 1000 或更多）；不知道司法意见书中引用的案例最多

摘自JSON:

resource_url 
id 
absolute_url 
cluster 
author 
joined_by 
author_str
per_curiam 
date_created 
date_modified 
type 
sha1 
page_count
download_url 
local_path 
plain_text 
html 
html_lawbox 
html_columbia
html_with_citations 
extracted_by_ocr
   
opinions_cited  (sub nodes) [0] - [28]

我在合并目录时需要帮助（我尝试了在 google 和 Whosebug 上找到的各种合并解决方案；none 我对我的 cpu，内存或时间证明工作而不会花几个小时出错或最终什么也没产生）。

如何为 JSON 的每个文件夹创建 1 个大型 CSV（包含数万个；在本例中为 54k）1 JSON 为 CSV 中的每一行使用 JSON 创建列的元素名称。

我使用的 Whosebug 脚本（代码可以在这里看到我在评论中发布的错误）：

import glob
import json

file_names = glob.glob('*.json')

json_list = []

for curr_f_name in file_names:
    with open(curr_f_name) as curr_f_obj:
        json_list.append(json.load(curr_f_obj))

with open('json_merge_out.json', 'w') as out_file:
    json.dump(json_list, out_file, indent=4)

以上尝试补救措施输出以下错误：

/Wash/json# ./json.merge.python.1.py
./json.merge.python.1.py: line 4: syntax error near unexpected token `('
./json.merge.python.1.py: line 4: `file_names = glob.glob('*.json')'

Answer 1

这是一个使用 jq 的快速解决方案。它几乎不需要任何内存，但有一般警告和一个潜在的并发症。

需要注意的是，假设“joined_by”数组可以使用“;”作为分隔符展开为字符串。

复杂的是一些 CSV 解析器需要“矩形”结构。在下面的第一个可能的解决方案中，生成的 header 将仅考虑显示的第一个文件中的“opinions_cited”数组。如果以后 JSON 个文件包含更多 “opinions_cited”，行仍然是正确的，但 header 不会考虑它们。

如果 header 是一个问题，那么您可能希望考虑 two-pass 解决方案以避免内存问题。如本答案的第 2 部分所示，第一遍将负责确定最大数组长度，在第二遍中，该值将用于确定适当的 headers.

您可能还希望考虑 post-processing“修正”。

为了测试，我建议使用 topKeys 的截断版本，例如

def topKeys: {resource_url, opinions_cited};

第 1 部分 - one-pass 方法

在所有 .json 文件所在的目录中调用 jq 测试了以下内容：

jq -Mnrf combine.jq *.json > wash.csv

combine.jq

def topKeys: {
  resource_url,
  id,
  absolute_url,
  cluster,
  author,
  joined_by,
  author_str,
  per_curiam,
  date_created,
  date_modified,
  type,
  sha1,
  page_count,
  download_url,
  local_path,
  plain_text,
  html,
  html_lawbox,
  html_columbia,
  html_with_citations,
  extracted_by_ocr
}
  | .joined_by |= (if type == "array" then join("; ") else . end)
;

def header:
 (topKeys|keys_unsorted) + [range(1; 1 + (.opinions_cited | length)) ];

def row:
  [topKeys[]] + .opinions_cited;

# s should be a stream of arrays
# For each item in the stream, a counter (starting at 1) is inserted before the other items
def insert_counter(s): foreach s as $x (0; .+1; [.] + $x);

# Read the first file for the headers
input
| (["record"] + header),
   insert_counter( (., inputs) | row )
| @csv

第 2 部分 - two-pass 解决方案

(a) 将上面header的def改成：

def header:
 (topKeys|keys_unsorted) + [range(1; 1 + $n) ];

(b) 使用或改编以下脚本：

#!/bin/bash

n=$(jq 'def max(s): reduce s as $x (null; if . == null or $x > . then $x else . end); max(.opinions_cited|length)' *.json )

jq -Mnr --argjson n $n -f combine.jq *.json > wash.csv

备案...

有很多 post-processing 选项。一种是使用csvq，例如：

csvq --allow-uneven-fields -f CSV 'select *' < wash.csv | sponge wash.csv

如何使用 python 或 jq 将具有统一列的多个 JSON 文件合并为一个 CSV？

How do I use python or jq to merge multiple JSON files with uniform columns into one CSV?

python

csv

merge

json

jq

第 1 部分 - one-pass 方法

combine.jq

第 2 部分 - two-pass 解决方案

备案...