在 R 中编辑和过滤 JSON 个列表列表

Question

我正在尝试显示此数据集 -> https://mtgjson.com/json/AllSets.json.zip

不过，我想将数据展平，这样它就不会嵌套为列表中、列表中、列表中的一堆 JSON 数据。

更具体地说，我试图将数据显示为数据框，按 $releaseDate（变量之一）的顺序排列。

这是我目前的尝试：

library(jsonlite)
library(tidyjson)
mtgdata <- fromJSON("~/path/to/file.json")

mtgdata 的结果显示此列表列表：

summary(mtgdata)
        Length Class  Mode
UST       9     -none- list
UNH      10     -none- list
UGL      11     -none- list
pWOS      8     -none- list
pWOR      8     -none- list
pWCQ      8     -none- list
pSUS      8     -none- list
pSUM     10     -none- list
pREL      8     -none- list
pPRO      8     -none- list
pPRE      8     -none- list
pPOD      7     -none- list
pMPR      8     -none- list
pMGD      8     -none- list
pMEI      8     -none- list
pLPA      8     -none- list
pLGM      8     -none- list
pJGP     10     -none- list
pHHO     11     -none- list
pWPN      8     -none- list
pGTW      8     -none- list
pGRU     10     -none- list
pGPX      8     -none- list
pFNM     10     -none- list
pELP      8     -none- list
pDRC      7     -none- list
pCMP      8     -none- list
pCEL      8     -none- list
pARL      8     -none- list
pALP     10     -none- list
p2HG      8     -none- list
p15A      8     -none- list
PD3       9     -none- list
PD2       9     -none- list
H09       9     -none- list
PTK      12     -none- list
POR      12     -none- list
PO2      13     -none- list
PCA       7     -none- list
PC2      10     -none- list
HOP      10     -none- list
VMA       9     -none- list
MMA      10     -none- list
MM3       8     -none- list
MM2      11     -none- list
MED       9     -none- list
ME4       9     -none- list
ME3       9     -none- list
ME2       9     -none- list
IMA       8     -none- list
EMA       9     -none- list
A25       8     -none- list
MPS_AKH   8     -none- list
MPS       9     -none- list
EXP       9     -none- list
E02       7     -none- list
V17       8     -none- list
V16       7     -none- list
V15       9     -none- list
V14       9     -none- list
V13       9     -none- list
V12      10     -none- list
V11      10     -none- list
V10       9     -none- list
V09      10     -none- list
DRB       9     -none- list
EVG       9     -none- list
DDT       7     -none- list
DDS       7     -none- list
DDR       7     -none- list
DDQ       8     -none- list
DDP      10     -none- list
DDO      10     -none- list
DDN      10     -none- list
DDM      10     -none- list
DDL      10     -none- list
DDK      10     -none- list
DDJ      10     -none- list
DDI      10     -none- list
DDH      10     -none- list
DDG      10     -none- list
DDF      10     -none- list
DDE      10     -none- list
DDD       9     -none- list
DDC       9     -none- list
DD3_JVC   9     -none- list
DD3_GVL   9     -none- list
DD3_EVG   9     -none- list
DD3_DVD   9     -none- list
DD2      11     -none- list
CNS      11     -none- list
CN2       9     -none- list
CMD      11     -none- list
CMA       7     -none- list
CM1      10     -none- list
C17       6     -none- list
C16       8     -none- list
C15      10     -none- list
C14      10     -none- list
C13      10     -none- list
CEI       9     -none- list
CED       9     -none- list
E01       7     -none- list
ARC       9     -none- list
ZEN      12     -none- list
XLN      12     -none- list
WWK      12     -none- list
WTH      13     -none- list
W17       8     -none- list
W16       8     -none- list
VIS      13     -none- list
VAN       8     -none- list
USG      13     -none- list
ULG      13     -none- list
UDS      13     -none- list
TSP      12     -none- list
TSB      12     -none- list
TPR      11     -none- list
TOR      12     -none- list
TMP      13     -none- list
THS      12     -none- list
STH      13     -none- list
SOM      12     -none- list
SOK      12     -none- list
SOI      10     -none- list
SHM      12     -none- list
SCG      12     -none- list
S99      11     -none- list
S00      11     -none- list
RTR      12     -none- list
RQS       6     -none- list
ROE      12     -none- list
RIX      12     -none- list
RAV      12     -none- list
PLS      13     -none- list
PLC      12     -none- list
PCY      13     -none- list
ORI      11     -none- list
ONS      12     -none- list
OGW      10     -none- list
ODY      13     -none- list
NPH      12     -none- list
NMS      14     -none- list
MRD      12     -none- list
MOR      12     -none- list
MMQ      13     -none- list
MIR      13     -none- list
MGB      10     -none- list
MD1       9     -none- list
MBS      12     -none- list
M15      11     -none- list
M14      11     -none- list
M13      11     -none- list
M12      11     -none- list
M11      11     -none- list
M10      11     -none- list
LRW      12     -none- list
LGN      12     -none- list
LEG      12     -none- list
LEB      11     -none- list
LEA      11     -none- list
KTK      12     -none- list
KLD       9     -none- list
JUD      12     -none- list
JOU      12     -none- list
ITP      11     -none- list
ISD      12     -none- list
INV      13     -none- list
ICE      13     -none- list
HOU       9     -none- list
HML      12     -none- list
GTC      12     -none- list
GPT      12     -none- list
FUT      12     -none- list
FRF_UGIN 10     -none- list
FRF      12     -none- list
FEM      11     -none- list
EXO      13     -none- list
EVE      12     -none- list
EMN       9     -none- list
DTK      12     -none- list
DST      12     -none- list
DRK      12     -none- list
DPA       9     -none- list
DKM       9     -none- list
DKA      12     -none- list
DIS      12     -none- list
DGM      12     -none- list
CST      11     -none- list
CSP      12     -none- list
CP3       7     -none- list
CP2       7     -none- list
CP1       7     -none- list
CON      13     -none- list
CHR      11     -none- list
CHK      12     -none- list
BTD      10     -none- list
BRB      10     -none- list
BOK      12     -none- list
BNG      12     -none- list
BFZ      12     -none- list
AVR      12     -none- list
ATQ      11     -none- list
ATH       9     -none- list
ARN      11     -none- list
ARB      12     -none- list
APC      13     -none- list
ALL      13     -none- list
ALA      12     -none- list
AKH       9     -none- list
AER       9     -none- list
9ED      12     -none- list
8ED      12     -none- list
7ED      12     -none- list
6ED      12     -none- list
5ED      12     -none- list
5DN      12     -none- list
4ED      12     -none- list
3ED      12     -none- list
2ED      11     -none- list
10E      11     -none- list

在这些列表中的每一个中，我都有兴趣分析这些变量，以过滤和排序这些数据，就好像它是一个扁平化的数据框一样。

当我们检查其中一个列表中的变量列表时（以 "mtgdata$UST" 为例），我们得到这组变量：

names(mtgdata$UST)
[1] "name"        "code"        "releaseDate" "border"      "type"        
"booster"     "mkm_name"   
[8] "mkm_id"      "cards"

运行 mtgdata ("mtgdata$SOI") 中另一个列表的相同查询我们得到另一组变量，尽管它们大部分相同。

正如我上面提到的，我主要感兴趣的是压平这个数据集并按 mtgdata$releaseDate 进行排名 - 但就目前而言，“$releaseDate”目前嵌套在第一组列表中（“$UST”等）

非常感谢您对此提供帮助或我如何更好地改写这个问题。

Answer 1

您可以在 command-line 上尝试类似 this 的操作，将 JSON 对象的数组转换为文件 ndjson 记录，然后使用类似 ndjson::stream_in("filename_of the_thing_you_just_converted") 的操作，但是您'最终会得到一个 14,000 多列，非常无用，"flat" 数据框。

相反，做一些洞穴探险：

library(tidyverse)

as1 <- jsonlite::read_json("~/Downloads/AllSets.json")

str(as1, 1) 
## List of 221
##  $ UST     :List of 9
##  $ UNH     :List of 10
##  $ UGL     :List of 11
##  $ pWOS    :List of 8
##  $ pWOR    :List of 8
##  $ pWCQ    :List of 8
##  $ pSUS    :List of 8
##  $ pSUM    :List of 10
##  $ pREL    :List of 8
##  $ pPRO    :List of 8
##  $ pPRE    :List of 8
##  $ pPOD    :List of 7
##  $ pMPR    :List of 8
##  $ pMGD    :List of 8
##  $ pMEI    :List of 8
##  $ pLPA    :List of 8
##  $ pLGM    :List of 8
##  $ pJGP    :List of 10
##  $ pHHO    :List of 11
## ...

呃…其中一个 "those" JSON 文件认为不适合填充每条记录的所有元素，即使整个文件 - 理论上 - 应该是一致的。

让我们看看哪些 JSON 数组元素填充的字段数量最多，因为这意味着这些元素可能已全部填充：

map_dbl(as1, length) %>% 
  broom::tidy() %>% 
  arrange(desc(x))
## # A tibble: 221 x 2
##    names     x
##    <chr> <dbl>
##  1 NMS    14.0
##  2 PO2    13.0
##  3 WTH    13.0
##  4 VIS    13.0
##  5 USG    13.0
##  6 ULG    13.0
##  7 UDS    13.0
##  8 TMP    13.0
##  9 STH    13.0
## 10 PLS    13.0
## # ... with 211 more rows

我们来看看NMS:

str(as1[["NMS"]], 1)
## List of 14
##  $ name              : chr "Nemesis"
##  $ code              : chr "NMS"
##  $ gathererCode      : chr "NE"
##  $ magicCardsInfoCode: chr "ne"
##  $ oldCode           : chr "NEM"
##  $ releaseDate       : chr "2000-02-14"
##  $ border            : chr "black"
##  $ type              : chr "expansion"
##  $ block             : chr "Masques"
##  $ booster           :List of 15
##  $ translations      :List of 5
##  $ mkm_name          : chr "Nemesis"
##  $ mkm_id            : int 32
##  $ cards             :List of 143

你真的不想压扁booster、translations或cards，应该将它们保持为list列并根据需要 unnest。

但是，由于每条记录都有不同的字段，我们不能简单地使用“data.table::rbindlist()ordplyr::bind_rows()`，因为它将抱怨其中的一些专栏。

我们必须去 record-by-record 并将每个转换为数据框，处理缺失的字段并将 list 的字段包装在 list() 中。我们将使用辅助函数来简化函数惯用语来测试缺失值：

`%l0%` <- function(x, y) if (length(x) > 0) x else y

^^ 比 %||% 更强大一点，后者与 purrr 一起出现。

最后：

map_df(as1, ~{
  data_frame(
    name = .x$name %l0% NA_character_,
    code = .x$code,
    gathererCode = .x$gathererCode %l0% NA_character_,
    magicCardsInfoCode = .x$magicCardsInfoCode %l0% NA_character_,
    oldCode = .x$oldCode %l0% NA_character_,
    releaseDate = .x$releaseDate %l0% NA_character_,
    border = .x$border,
    type = .x$type,
    block = .x$block %l0% NA_character_,
    booster = list(.x$booster),
    translations = list(.x$translations),
    mkm_name = .x$mkm_name %l0% NA_character_,
    mkm_id = .x$mkm_id %l0% NA_character_,
    cards = list(.x$cards)
  )
}) -> all_sets

并且，你可以看到结果：

all_sets
## # A tibble: 221 x 14
##    name           code  gathererCode magicCardsInfoC… oldCode releaseDate border type  block booster 
##    <chr>          <chr> <chr>        <chr>            <chr>   <chr>       <chr>  <chr> <chr> <list>  
##  1 Unstable       UST   NA           NA               NA      2017-12-08  silver un    NA    <list […
##  2 Unhinged       UNH   NA           uh               NA      2004-11-20  silver un    NA    <list […
##  3 Unglued        UGL   UG           ug               NA      1998-08-11  silver un    NA    <list […
##  4 Wizards of th… pWOS  NA           wotc             NA      1999-09-04  black  promo NA    <NULL>  
##  5 Worlds         pWOR  NA           wrl              NA      1999-08-04  black  promo NA    <NULL>  
##  6 World Magic C… pWCQ  NA           wmcq             NA      2013-04-06  black  promo NA    <NULL>  
##  7 Super Series   pSUS  NA           sus              NA      1999-12-01  black  promo NA    <NULL>  
##  8 Summer of Mag… pSUM  NA           sum              NA      2007-07-21  black  promo NA    <NULL>  
##  9 Release Events pREL  NA           rep              NA      2003-07-26  black  promo NA    <NULL>  
## 10 Pro Tour       pPRO  NA           pro              NA      2007-02-09  black  promo NA    <NULL>  
## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,
## #   cards <list>

glimpse(all_sets)
## Observations: 221
## Variables: 14
## $ name               <chr> "Unstable", "Unhinged", "Unglued", "Wizards of the Coast Online Store"...
## $ code               <chr> "UST", "UNH", "UGL", "pWOS", "pWOR", "pWCQ", "pSUS", "pSUM", "pREL", "...
## $ gathererCode       <chr> NA, NA, "UG", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ magicCardsInfoCode <chr> NA, "uh", "ug", "wotc", "wrl", "wmcq", "sus", "sum", "rep", "pro", "pt...
## $ oldCode            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ releaseDate        <chr> "2017-12-08", "2004-11-20", "1998-08-11", "1999-09-04", "1999-08-04", ...
## $ border             <chr> "silver", "silver", "silver", "black", "black", "black", "black", "bla...
## $ type               <chr> "un", "un", "un", "promo", "promo", "promo", "promo", "promo", "promo"...
## $ block              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ booster            <list> [["rare", "uncommon", "uncommon", "uncommon", "common", "common", "co...
## $ translations       <list> [NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NU...
## $ mkm_name           <chr> "Unstable", "Unhinged", "Unglued", NA, NA, NA, NA, "Summer Magic", NA,...
## $ mkm_id             <int> 1821, 59, 22, NA, NA, NA, NA, 76, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cards              <list> [[["Andrea Radeck", 1, ["W"], ["White"], "95ebdf85f4ea74d584dfdfb72e3...

并且，在将列转换为适当的 Date 对象后，我们可以按 releaseDate 排列它们：

mutate(all_sets, releaseDate = lubridate::ymd(releaseDate)) %>% 
  arrange(desc(releaseDate))
## # A tibble: 221 x 14
##    name        code  gathererCode magicCardsInfoCo… oldCode releaseDate border type     block booster
##    <chr>       <chr> <chr>        <chr>             <chr>   <date>      <chr>  <chr>    <chr> <list> 
##  1 Masters 25  A25   NA           a25               NA      2018-03-16  black  reprint  NA    <NULL> 
##  2 Rivals of … RIX   NA           rix               NA      2018-01-19  black  expansi… Ixal… <list …
##  3 Unstable    UST   NA           NA                NA      2017-12-08  silver un       NA    <list …
##  4 Explorers … E02   NA           e02               NA      2017-11-24  black  board g… NA    <NULL> 
##  5 From the V… V17   NA           v17               NA      2017-11-24  black  from th… NA    <NULL> 
##  6 Iconic Mas… IMA   NA           ima               NA      2017-11-17  black  reprint  NA    <list …
##  7 Duel Decks… DDT   NA           ddt               NA      2017-11-10  black  duel de… NA    <NULL> 
##  8 Ixalan      XLN   NA           xln               NA      2017-09-29  black  expansi… Ixal… <list …
##  9 Commander … C17   NA           NA                NA      2017-08-25  black  command… NA    <NULL> 
## 10 Hour of De… HOU   NA           hou               NA      2017-07-14  black  expansi… Amon… <list …
## # ... with 211 more rows, and 4 more variables: translations <list>, mkm_name <chr>, mkm_id <int>,
## #   cards <list>

在 R 中编辑和过滤 JSON 个列表列表

Edit and Filter JSON List of Lists in R

json

r

dplyr

jsonlite

tidyr