将多个 JSON 文件加载到 R（国会日期）

Question

我正在尝试将 12,880 个 json 文件加载到 R 中的数据框中，但遇到了一些问题。任何关于我做错了什么的指示将不胜感激！

简而言之，我试图查看美国政客随时间变化的平均年龄，并从国会图书馆下载了所有国会政客的数据：https://bioguide.congress.gov/search（您可以通过单击下载整个数据库右上角的“下载”）。

解压缩后，有 12,880 json 个文件（70mb 以下）。

我已经能够加载一些数据作为列表：

library(tidyverse)
library(jsonlite)
library(magrittr)

path<-"./data/BioguideProfiles" #This is where I am storing the unzipped data
files<-dir(path,pattern="*.json") #This is the list of all the json files
head(files) #Shows me that we have the right list
length(files) #Shows we have right number in list

test<-fromJSON("./data/BioguideProfiles/A000002.json", simplifyDataFrame = TRUE, flatten = TRUE) #loads in 1 file as a list

这显示了眼前的问题：R 仅将 JSON 文件识别为列表，而未将其转换为数据帧。如果我尝试将其强制放入数据框（使用 as.data.frame），我会得到：

Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 1, 0, 13, 11

使用来自另一个 Whosebug post 的代码，我试图一次加载所有内容：

data<-files %>%
  map_df(~fromJSON(file.path(path, .),flatten = TRUE))

这运行了一会儿，但他们给了我错误：

> data<-files %>%
+   map_df(~fromJSON(file.path(path, .),flatten = TRUE))
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/vctrs_error_incompatible_size>
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
---
Backtrace:
  1. files %>% map_df(~fromJSON(file.path(path, .), flatten = TRUE))
  8. vctrs::stop_incompatible_size(...)
  9. vctrs:::stop_incompatible(...)
 10. vctrs:::stop_vctrs(...)
Run `rlang::last_trace()` to see the full context.
> rlang::last_trace()
<error/vctrs_error_incompatible_size>
Error in `stop_vctrs()`:
! Can't recycle `usCongressBioId` (size 0) to match `creativeWork` (size 2).
---
Backtrace:
     ▆
  1. ├─files %>% map_df(~fromJSON(file.path(path, .), flatten = TRUE))
  2. ├─purrr::map_df(., ~fromJSON(file.path(path, .), flatten = TRUE))
  3. │ └─dplyr::bind_rows(res, .id = .id)
  4. │   └─dplyr:::map(...)
  5. │     └─base::lapply(.x, .f, ...)
  6. │       └─dplyr FUN(X[[i]], ...)
  7. │         └─vctrs::data_frame(!!!.x, .name_repair = "minimal")
  8. └─vctrs::stop_incompatible_size(...)
  9.   └─vctrs:::stop_incompatible(...)
 10.     └─vctrs:::stop_vctrs(...)
 11.       └─rlang::abort(message, class = c(class, "vctrs_error"), ...)

（道歉和感谢 - 在做了几年其他事情后我又回到了 R，所以我很生疏......）

Answer 1

导入的单个 json 文件是一个包含不相等数据框的嵌套列表。因此它不能转换为单个数据帧。

相反，您可以将所有 json 文件导入为，

files = dir("./data/BioguideProfiles",pattern="*.json") #Generates list of files
files  = paste0("./data/BioguideProfiles/", files) #appends filepath on filenames
df = lapply(files, fromJSON) #loads in data from json files as list

现在您在一个列表中拥有所有 12,880 json。

现在，如果您想 select birthDate 来自您可以使用的每个元素

lapply(df, `[[`, 'birthDate')

进一步参考 selecting 列表中的元素

将多个 JSON 文件加载到 R（国会日期）

Loading multiple JSON files into R (congressional dates)

json

r

dataframe

jsonlite