从列表列表中提取多列,并保存在 data.frame
Extract multiple columns from list of lists, and save in data.frame
我有以下列表:
library(rjson)
j <- fromJSON(file='https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')
我有兴趣从此组件中提取数据:j$response$docs
,这是一个列表列表。 'internal' 列表都应该具有相同的名称。
我想将输出保存到 data.frame()
或 tibble()
。
对于少数 selected 变量,下面的代码有效并给出了所需的输出:
nmod <- length(j$response$docs)
for (i in 1:nmod) {
#select one list at a time
j1 <- j$response$docs[[i]]
tmp <- data.frame(variable=j1$variable,
variable_long_name=j1$variable_long_name,
rcm_name=j1$rcm_name,
driving_model=j1$driving_model,
cf_standard_name=j1$cf_standard_name
)
#join them
if (i==1) {
d <- tmp
} else {
d <- rbind(d, tmp)
}
}
不过,我想知道是否有更优雅、更高效的方法,也许使用 tidyr
、dplyr
或 purrr
,这也能让我 select 所有“列”,而不是只有少数 select 在那里。
而不是 rjson
用这个:
library(jsonlite)
j <- jsonlite::fromJSON('https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')
# The names you wan to find in the nested returned data
look_for <- c('variable','variable_long_name' ,
'rcm_name','driving_model',
'cf_standard_name')
new_df <- as.data.frame(sapply(look_for, function(i){
unlist(j$response$docs[[i]])
}))
str(new_df)
'data.frame': 832 obs. of 5 variables:
$ variable : chr "clh" "clivi" "rsds" "rlds" ...
$ variable_long_name: chr "High Level Cloud Fraction" "Ice Water Path" "Surface Downwelling Shortwave Radiation" "Surface Downwelling Longwave Radiation" ...
$ rcm_name : chr "HIRHAM5" "HIRHAM5" "HIRHAM5" "HIRHAM5" ...
$ driving_model : chr "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" ...
$ cf_standard_name : chr "cloud_area_fraction_in_atmosphere_layer" "atmosphere_cloud_ice_content" "surface_downwelling_shortwave_flux_in_air" "surface_downwelling_longwave_flux_in_air" ...
你可以在软件包 purrr 的帮助下完成。我认为 at_depth
可能在这里工作,但我最终使用了嵌套 map_df
。
library(purrr)
你的变量长度不同,所以首先要做的是确保每个变量的长度都是 1。这可以通过用 paste
折叠内部列表的每个元素来完成。我用逗号作为分隔符。通过 map_df
returns 1 行 tibble
.
执行此操作
这是第一个内部列表的示例。
map_df(j$response$docs[[1]], paste, collapse = ",")
现在我们可以遍历外部列表,为每个创建 1 行 tibble
。我们使用 map_df
将它们中的每一个绑定在一起。输出是 832 行 tibble
,每个列表一行。我使用 .id
参数将分组变量添加到结果中,这可能不需要。
d1 = map_df(j$response$docs, ~map_df(.x, paste, collapse = ","))
d1
# A tibble: 832 × 45
group id version
<chr> <chr> <chr>
1 1 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clh.v20131119|cordexesg.dmi.dk 20131119
2 2 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clivi.v20131119|cordexesg.dmi.dk 20131119
3 3 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsds.v20131119|cordexesg.dmi.dk 20131119
4 4 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlds.v20131119|cordexesg.dmi.dk 20131119
5 5 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsus.v20131119|cordexesg.dmi.dk 20131119
6 6 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlus.v20131119|cordexesg.dmi.dk 20131119
7 7 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsdt.v20131119|cordexesg.dmi.dk 20131119
8 8 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsut.v20131119|cordexesg.dmi.dk 20131119
9 9 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlut.v20131119|cordexesg.dmi.dk 20131119
10 10 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.psl.v20131119|cordexesg.dmi.dk 20131119
# ... with 822 more rows, and 42 more variables:
如果你想为长度大于1的变量获取多行,比如access
和experiment_family
,你可以使用tidyr::separate_rows
将数据分成多个行。
tidyr::separate_rows(d1, experiment_family)
我有以下列表:
library(rjson)
j <- fromJSON(file='https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')
我有兴趣从此组件中提取数据:j$response$docs
,这是一个列表列表。 'internal' 列表都应该具有相同的名称。
我想将输出保存到 data.frame()
或 tibble()
。
对于少数 selected 变量,下面的代码有效并给出了所需的输出:
nmod <- length(j$response$docs)
for (i in 1:nmod) {
#select one list at a time
j1 <- j$response$docs[[i]]
tmp <- data.frame(variable=j1$variable,
variable_long_name=j1$variable_long_name,
rcm_name=j1$rcm_name,
driving_model=j1$driving_model,
cf_standard_name=j1$cf_standard_name
)
#join them
if (i==1) {
d <- tmp
} else {
d <- rbind(d, tmp)
}
}
不过,我想知道是否有更优雅、更高效的方法,也许使用 tidyr
、dplyr
或 purrr
,这也能让我 select 所有“列”,而不是只有少数 select 在那里。
而不是 rjson
用这个:
library(jsonlite)
j <- jsonlite::fromJSON('https://esgf-data.dkrz.de/esg-search/search/?offset=0&limit=1000&type=Dataset&replica=false&latest=true&project=CORDEX&domain=EUR-11&experiment=rcp85&time_frequency=day&facets=rcm_name%2Cproject%2Cproduct%2Cdomain%2Cinstitute%2Cdriving_model%2Cexperiment%2Cexperiment_family%2Censemble%2Crcm_version%2Ctime_frequency%2Cvariable%2Cvariable_long_name%2Ccf_standard_name%2Cdata_node&format=application%2Fsolr%2Bjson')
# The names you wan to find in the nested returned data
look_for <- c('variable','variable_long_name' ,
'rcm_name','driving_model',
'cf_standard_name')
new_df <- as.data.frame(sapply(look_for, function(i){
unlist(j$response$docs[[i]])
}))
str(new_df)
'data.frame': 832 obs. of 5 variables:
$ variable : chr "clh" "clivi" "rsds" "rlds" ...
$ variable_long_name: chr "High Level Cloud Fraction" "Ice Water Path" "Surface Downwelling Shortwave Radiation" "Surface Downwelling Longwave Radiation" ...
$ rcm_name : chr "HIRHAM5" "HIRHAM5" "HIRHAM5" "HIRHAM5" ...
$ driving_model : chr "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" "ICHEC-EC-EARTH" ...
$ cf_standard_name : chr "cloud_area_fraction_in_atmosphere_layer" "atmosphere_cloud_ice_content" "surface_downwelling_shortwave_flux_in_air" "surface_downwelling_longwave_flux_in_air" ...
你可以在软件包 purrr 的帮助下完成。我认为 at_depth
可能在这里工作,但我最终使用了嵌套 map_df
。
library(purrr)
你的变量长度不同,所以首先要做的是确保每个变量的长度都是 1。这可以通过用 paste
折叠内部列表的每个元素来完成。我用逗号作为分隔符。通过 map_df
returns 1 行 tibble
.
这是第一个内部列表的示例。
map_df(j$response$docs[[1]], paste, collapse = ",")
现在我们可以遍历外部列表,为每个创建 1 行 tibble
。我们使用 map_df
将它们中的每一个绑定在一起。输出是 832 行 tibble
,每个列表一行。我使用 .id
参数将分组变量添加到结果中,这可能不需要。
d1 = map_df(j$response$docs, ~map_df(.x, paste, collapse = ","))
d1
# A tibble: 832 × 45
group id version
<chr> <chr> <chr>
1 1 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clh.v20131119|cordexesg.dmi.dk 20131119
2 2 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.clivi.v20131119|cordexesg.dmi.dk 20131119
3 3 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsds.v20131119|cordexesg.dmi.dk 20131119
4 4 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlds.v20131119|cordexesg.dmi.dk 20131119
5 5 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsus.v20131119|cordexesg.dmi.dk 20131119
6 6 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlus.v20131119|cordexesg.dmi.dk 20131119
7 7 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsdt.v20131119|cordexesg.dmi.dk 20131119
8 8 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rsut.v20131119|cordexesg.dmi.dk 20131119
9 9 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.rlut.v20131119|cordexesg.dmi.dk 20131119
10 10 cordex.output.EUR-11.DMI.ICHEC-EC-EARTH.rcp85.r3i1p1.HIRHAM5.v1.day.psl.v20131119|cordexesg.dmi.dk 20131119
# ... with 822 more rows, and 42 more variables:
如果你想为长度大于1的变量获取多行,比如access
和experiment_family
,你可以使用tidyr::separate_rows
将数据分成多个行。
tidyr::separate_rows(d1, experiment_family)