如何将大型、复杂、深度嵌套的 JSON 文件扁平化为多个 CSV 文件链接标识符

Question

我有一个复杂的 JSON 文件 (~8GB)，其中包含企业的公开可用数据。我们已决定将文件拆分为多个 CSV 文件（或 .xlsx 中的选项卡），以便客户可以轻松使用数据。这些文件将由 NZBN column/key.

链接

我正在使用 R 和 jsonlite 读取一个小样本（在扩展到完整文件之前）。我猜我需要一些方法来指定每个文件中的 key/columns（即，第一个文件将具有 headers：australianBusinessNumber、australianCompanyNumber、australianServiceAddress，第二个文件将具有 headers : annualReturnFilingMonth, annualReturnLastFiled, countryOfOrigin...)

这是两个 businesses/entities 的示例（我也删除了一些数据，因此忽略了实际值）：test file

我几乎读过 post 上关于 s/o 的所有类似问题，none 似乎给了我好运。我已经尝试过 purrr 的变体、*apply 命令、自定义扁平化函数和 jqr（'jq' 的 r 版本 - 看起来很有希望，但我似乎无法运行它）。

这是创建我的单独文件的尝试，但我不确定如何包含链接标识符 (NZBN) + 我将运行ning 保留到进一步的嵌套列表中（我不确定有多少层嵌套有）

bulk <- jsonlite::fromJSON("bd_test.json")

coreEntity <- data.frame(bulk$companies)
coreEntity <- coreEntity[,sapply(coreEntity, is.list)==FALSE] 

company <- bulk$companies$entity$company
company <- purrr::reduce(company, dplyr::bind_rows)

shareholding <- company$shareholding
shareholding <- purrr::reduce(shareholding, dplyr::bind_rows)

shareAllocation <- shareholding$shareAllocation
shareAllocation <- purrr::reduce(shareAllocation, dplyr::bind_rows)

我不确定在 flattening/wrangling 过程中拆分文件是否更容易，或者只是完全展平整个文件所以我每个 business/entity 只有一行（然后收集根据需要列）-我唯一担心的是我需要将其扩展到约 130 万个节点（8GB JSON 文件）。

理想情况下，我希望每次有新的 collection 时都拆分 csv 文件，并且 collection 中的值将成为新的 csv/tab 的列。

如有任何帮助或提示，我们将不胜感激。

-------- 更新 ------

已更新，因为我的问题有点含糊，我想我需要的只是一些代码来生成一个 csv's/tabs，然后我复制另一个 collections。

例如，我想创建一个包含以下元素的 csv：

entityName（唯一链接标识符）
nzbn（唯一链接标识符）
emailAddress__uniqueIdentifier
emailAddress__emailAddress
emailAddress__emailPurpose
emailAddress__emailPurpose描述
emailAddress__startDate

我该怎么做？

Answer 1

i'm unsure how many levels of nesting there are

这将非常有效地提供答案：

jq '
  def max(s): reduce s as $s (null; 
    if . == null then $s elif $s > . then $s else . end);
   max(paths|length)' input.json

(有测试文件，答案是14。)

要获得数据的整体视图（架构），您可以运行:

 jq 'include "schema"; schema' input.json

其中 schema.jq 可用于此 gist。这将产生一个结构模式。

"Say for example, I wanted to create a csv of the following elements:"

除了 headers:

之外，还有一个 jq 解决方案

.companies.entity[]
| [.entityName, .nzbn]
  + (.emailAddress[] | [.uniqueIdentifier, .emailAddress, .emailPurpose, .emailPurposeDescription, .startDate])
| @csv

持股

股权数据比较复杂，所以在下文中我使用了本页其他地方定义的to_table函数。

示例数据不包含 "company name" 字段，因此在下面，我添加了一个基于 0 的 "company index" 字段：

  .companies.entity[]
  | [.entityName, .nzbn] as $ix
  | .company
  | range(0;length) as $cix
  | .[$cix]
  | $ix + [$cix] + (.shareholding[] | to_table(false))

jqr

以上解决方案使用独立的 jq 可执行文件，但一切顺利，使用与 jqr 相同的过滤器应该是微不足道的，但使用 jq 的 include，可能最简单明确指定路径，例如：

include "schema" {search: "~/.jq"};

Answer 2

如果输入JSON足够规则，你可能会发现以下展平函数很有用，特别是因为它可以基于输入的叶元素 "paths" 以字符串数组的形式发出 header，可以任意嵌套：

# to_table produces a flat array.
# If hdr == true, then ONLY emit a header line (in prettified form, i.e. as an array of strings);
# if hdr is an array, it should be the prettified form and is used to check consistency.
def to_table(hdr):
  def prettify: map( (map(tostring)|join(":") ));
  def composite: type == "object" or type == "array";

  def check:
     select(hdr|type == "array") 
     | if prettify == hdr then empty
       else error("expected head is \(hdr) but imputed header is \(.)")
       end ;

  . as $in
  | [paths(composite|not)]           # the paths in array-of-array form
  | if hdr==true then prettify
    else check, map(. as $p | $in | getpath($p))
    end;

例如，要为 .emailAddress 生成所需的 table（没有 headers），可以这样写：

.companies.entity[]
| [.entityName, .nzbn] as $ix
| $ix + (.emailAddress[] | to_table(false))
| @tsv

(添加 header 并检查一致性，现在留作练习，但在下面处理。）

正在生成多个文件

更有趣的是，您可以 select 您想要的级别，并自动生成多个 table。将输出有效地划分为单独文件的一种方法是使用 awk。例如，您可以通过管道传输使用此 jq 过滤器获得的输出：

["entityName", "nzbn"] as $common
| .companies.entity[]
| [.entityName, .nzbn] as $ix
| (to_entries[] | select(.value | type == "array") | .key) as $key
| ($ix + [$key] | join("-")) as $filename
| (.[$key][0]|to_table(true)) as $header

# First emit the line giving all the headers:
| $filename, ($common + $header | @tsv),
# Then emit the rows of the table:
  (.[$key][]
   | ($filename,  ($ix + to_table(false) | @tsv)))

至

awk -F\t 'fn {print >> fn; fn=0;next} {fn=".tsv"}'

这将在每个文件中生成 header；如果要进行一致性检查，请将 to_table(false) 更改为 to_table($header)。

如何将大型、复杂、深度嵌套的 JSON 文件扁平化为多个 CSV 文件链接标识符

How do I flatten a large, complex, deeply nested JSON file into multiple CSV files a linking identifier

csv

json

r

jsonlite

jq

"Say for example, I wanted to create a csv of the following elements:"

持股

jqr

正在生成多个文件