使用 `purrr` 从列表列表中提取数据到它自己的 `data.frame`
Extracting data from a list of lists into its own `data.frame` with `purrr`
代表性样本数据(列表列表):
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label",
"link", "score")), e = 49.1279871269422), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.934821052832427,
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina",
link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 9L, label = "Nebraska", link = "America/Scoresbysund",
score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id",
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f",
c = "P", d = list(structure(list(id = 8L, label = "Georgia",
link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 2L, label = "Washington", link = "America/Shiprock",
score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 6L, label = "North Dakota", link = "Universal",
score = 1.03168296038975), .Names = c("id", "label",
"link", "score")), structure(list(id = 1L, label = "New Hampshire",
link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id",
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania",
link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id",
"label", "link", "score"))), e = 132.1153538536), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x",
c = "O", d = structure(list(id = 3L, label = "Delaware",
link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id",
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.396243444741009,
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota",
link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 9L, label = "Nebraska", link = "America/Ojinaga",
score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id",
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a",
"b", "c", "d", "e")))
我有一个列表列表,凭借JSON资料下载
该列表有 176 个元素,每个元素有 33 个嵌套元素,其中一些元素也是长度不一的列表。
我有兴趣分析包含在特定嵌套列表中的数据,对于 176 个具有 4 或 5 个元素的每个,该列表的长度约为 150 - 有些有 4 个,有些有 5 个。我是尝试提取这个感兴趣的嵌套列表并将其转换为 data.frame
以便能够执行一些分析。
在上面的代表性示例数据中,我对 l
的 5 个元素中的每一个的嵌套列表 d
感兴趣。因此,所需的 data.frame
看起来像:
id label link score externalId
5 Utah Asia/Anadyr -0.2110459 NA
8 South Carolina Pacific/Wallis 0.5265409 -6.743544
.
.
我一直在尝试使用 purrr
,它似乎具有合理且一致的流程来处理列表中的数据,但我 运行 陷入了无法完全理解的错误原因很可能是我没有正确理解 purrr
的 commands/logic 或列表(可能两者都有)。这是我一直在尝试但抛出相关错误的代码:
df <- map_df(l, "d", ~as.data.frame(.))
Error: incompatible sizes (5 != 4)
我认为这与每个组件的 d
长度不同有关,或者可能是包含的数据不同(有时 4 个元素,有时 5 个),或者我在这里使用的函数指定错误 - - 说实话我不太确定。
我已经通过使用 for 循环解决了这个问题,我知道这是低效的,因此我的问题是关于 SO 的。
这是我目前使用的 for 循环:
df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric())
for(i in seq_along(l)){
df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.))
df <- rbind(df, df_temp)
}
一些帮助最好是 purrr
- 或者 apply
的某些版本,因为它仍然优于我的 for 循环 - 将不胜感激。另外,如果上面有资源,我想了解而不是仅仅找到正确的代码。
关于 purrr 的更多信息,我推荐 Grolemund 和 Wickham 的 "R for Data Science" http://r4ds.had.co.nz/
我认为您面临的一个问题是 l$d
中的一些项目是变量列表,每个变量都有一个观察值,可以转换为数据框,而其他项目是此类列表的列表。
但我自己也不太擅长咕噜声。以下是我的做法:
l <- lapply(l, function(x){x$d}) ## work with the data you need.
list_of_observations <- Filter(function(x) {!is.null(names(x))},l)
list_of_lists <- Filter(function(x) {is.null(names(x))}, l)
another_list_of_observations <- unlist(list_of_lists, recursive=FALSE)
df <- lapply(c(list_of_observations, another_list_of_observations),
as.data.frame) %>% bind_rows
您可以分三步完成此操作,首先提取 d
,然后绑定 d
的每个元素内的行,然后将所有内容绑定到一个对象中。
我使用 dplyr 中的 bind_rows
作为列表内行绑定。 map_df
进行最后的行绑定。
library(purrr)
library(dplyr)
l %>%
map("d") %>%
map_df(bind_rows)
这也等价于:
map_df(l, ~bind_rows(.x[["d"]] ) )
结果如下:
# A tibble: 12 x 5
id label link score externalId
<int> <chr> <chr> <dbl> <dbl>
1 5 Utah Asia/Anadyr -0.2110459 NA
2 8 South Carolina Pacific/Wallis 0.5265409 -6.743544
3 9 Nebraska America/Scoresbysund 0.2508955 16.425747
4 8 Georgia America/Nome 0.5264941 7.915836
5 2 Washington America/Shiprock -0.5551864 15.068666
6 6 North Dakota Universal 1.0316830 NA
7 1 New Hampshire America/Cordoba 1.2158206 9.727642
8 1 Alaska Asia/Istanbul -0.2318326 NA
9 4 Pennsylvania Africa/Dar_es_Salaam 0.5902453 NA
10 3 Delaware Asia/Samarkand 0.6955771 15.236482
11 4 North Dakota America/Tortola 1.0306027 -7.216669
12 9 Nebraska America/Ojinaga -1.1139800 -8.451451
代表性样本数据(列表列表):
l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T",
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr",
score = -0.21104594634643), .Names = c("id", "label",
"link", "score")), e = 49.1279871269422), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.934821052832427,
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina",
link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 9L, label = "Nebraska", link = "America/Scoresbysund",
score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id",
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f",
c = "P", d = list(structure(list(id = 8L, label = "Georgia",
link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 2L, label = "Washington", link = "America/Shiprock",
score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 6L, label = "North Dakota", link = "Universal",
score = 1.03168296038975), .Names = c("id", "label",
"link", "score")), structure(list(id = 1L, label = "New Hampshire",
link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id",
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania",
link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id",
"label", "link", "score"))), e = 132.1153538536), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x",
c = "O", d = structure(list(id = 3L, label = "Delaware",
link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id",
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a",
"b", "c", "d", "e")), structure(list(a = -0.396243444741009,
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota",
link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id",
"label", "link", "score", "externalId")), structure(list(
id = 9L, label = "Nebraska", link = "America/Ojinaga",
score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id",
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a",
"b", "c", "d", "e")))
我有一个列表列表,凭借JSON资料下载
该列表有 176 个元素,每个元素有 33 个嵌套元素,其中一些元素也是长度不一的列表。
我有兴趣分析包含在特定嵌套列表中的数据,对于 176 个具有 4 或 5 个元素的每个,该列表的长度约为 150 - 有些有 4 个,有些有 5 个。我是尝试提取这个感兴趣的嵌套列表并将其转换为 data.frame
以便能够执行一些分析。
在上面的代表性示例数据中,我对 l
的 5 个元素中的每一个的嵌套列表 d
感兴趣。因此,所需的 data.frame
看起来像:
id label link score externalId
5 Utah Asia/Anadyr -0.2110459 NA
8 South Carolina Pacific/Wallis 0.5265409 -6.743544
.
.
我一直在尝试使用 purrr
,它似乎具有合理且一致的流程来处理列表中的数据,但我 运行 陷入了无法完全理解的错误原因很可能是我没有正确理解 purrr
的 commands/logic 或列表(可能两者都有)。这是我一直在尝试但抛出相关错误的代码:
df <- map_df(l, "d", ~as.data.frame(.))
Error: incompatible sizes (5 != 4)
我认为这与每个组件的 d
长度不同有关,或者可能是包含的数据不同(有时 4 个元素,有时 5 个),或者我在这里使用的函数指定错误 - - 说实话我不太确定。
我已经通过使用 for 循环解决了这个问题,我知道这是低效的,因此我的问题是关于 SO 的。
这是我目前使用的 for 循环:
df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric())
for(i in seq_along(l)){
df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.))
df <- rbind(df, df_temp)
}
一些帮助最好是 purrr
- 或者 apply
的某些版本,因为它仍然优于我的 for 循环 - 将不胜感激。另外,如果上面有资源,我想了解而不是仅仅找到正确的代码。
关于 purrr 的更多信息,我推荐 Grolemund 和 Wickham 的 "R for Data Science" http://r4ds.had.co.nz/
我认为您面临的一个问题是 l$d
中的一些项目是变量列表,每个变量都有一个观察值,可以转换为数据框,而其他项目是此类列表的列表。
但我自己也不太擅长咕噜声。以下是我的做法:
l <- lapply(l, function(x){x$d}) ## work with the data you need.
list_of_observations <- Filter(function(x) {!is.null(names(x))},l)
list_of_lists <- Filter(function(x) {is.null(names(x))}, l)
another_list_of_observations <- unlist(list_of_lists, recursive=FALSE)
df <- lapply(c(list_of_observations, another_list_of_observations),
as.data.frame) %>% bind_rows
您可以分三步完成此操作,首先提取 d
,然后绑定 d
的每个元素内的行,然后将所有内容绑定到一个对象中。
我使用 dplyr 中的 bind_rows
作为列表内行绑定。 map_df
进行最后的行绑定。
library(purrr)
library(dplyr)
l %>%
map("d") %>%
map_df(bind_rows)
这也等价于:
map_df(l, ~bind_rows(.x[["d"]] ) )
结果如下:
# A tibble: 12 x 5
id label link score externalId
<int> <chr> <chr> <dbl> <dbl>
1 5 Utah Asia/Anadyr -0.2110459 NA
2 8 South Carolina Pacific/Wallis 0.5265409 -6.743544
3 9 Nebraska America/Scoresbysund 0.2508955 16.425747
4 8 Georgia America/Nome 0.5264941 7.915836
5 2 Washington America/Shiprock -0.5551864 15.068666
6 6 North Dakota Universal 1.0316830 NA
7 1 New Hampshire America/Cordoba 1.2158206 9.727642
8 1 Alaska Asia/Istanbul -0.2318326 NA
9 4 Pennsylvania Africa/Dar_es_Salaam 0.5902453 NA
10 3 Delaware Asia/Samarkand 0.6955771 15.236482
11 4 North Dakota America/Tortola 1.0306027 -7.216669
12 9 Nebraska America/Ojinaga -1.1139800 -8.451451