R:将嵌套列表中的值拉入数据框时出现问题
R: Issues when pulling values from a nested list into a dataframe
因此,关于将列表中的项目拉入数据框,这应该是一个相对简单的问题,但我遇到了一些问题。
我有以下列表(我只为您展示了列表的一部分,它比这个长得多):
str(raw_jobs_list)
List of 2
$ :List of 4
..$ id : chr "3594134"
..$ score : int 1
..$ fields:List of 16
.. ..$ date :List of 3
.. .. ..$ changed: chr "2020-04-18T00:35:00+00:00"
.. .. ..$ created: chr "2020-04-07T11:15:37+00:00"
.. .. ..$ closing: chr "2020-04-17T00:00:00+00:00"
.. ..$ country :List of 1
.. .. ..$ :List of 6
.. .. .. ..$ href : chr "https://api.reliefweb.int/v1/countries/149"
.. .. .. ..$ name : chr "Mali"
.. .. .. ..$ location :List of 2
.. .. .. .. ..$ lon: num -1.25
.. .. .. .. ..$ lat: num 17.4
.. .. .. ..$ id : int 149
.. .. .. ..$ shortname: chr "Mali"
.. .. .. ..$ iso3 : chr "mli"
.. ..$ title : chr "REGIONAL MANAGER West Africa"
我尝试使用以下方法将它们拉出来:
jobs_data_df <- list.stack(list.select(raw_jobs_list,
fields$title,
fields$country$name,
fields$date$created))
其中 raw_jobs_list 是列表,但我得到了这些 NA,但不确定如何通过它。
glimpse(jobs_data_df)
Rows: 2
Columns: 3
$ V1 <chr> "REGIONAL MANAGER West Africa", "Support Relief Group Public Health Advisor (Multiple Positions)"
$ V2 <lgl> NA, NA
$ V3 <chr> "2020-04-07T11:15:37+00:00", "2020-05-04T15:20:37+00:00"
可能有一些明显的东西我忽略了,因为我以前没有太多地使用过列表。有什么想法吗?
非常感谢!
C
PS。如果您有兴趣,我正在使用这个 API,这就是我到目前为止的方式。
jobs <- GET(url = "https://api.reliefweb.int/v1/jobs?appname=apidoc&preset=analysis&profile=full&limit=2")
raw_jobs_list <- content(jobs)$data
上面显示的部分是整个数据的一个子集;这是列表第一个元素的一部分:
dput(lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))]))))
list(list(id = "3594134", score = 1L, fields = list(date = list(
changed = "2020-04-18T00:35:00+00:00", created = "2020-04-07T11:15:37+00:00",
closing = "2020-04-17T00:00:00+00:00"), country = list(list(
href = "https://api.reliefweb.int/v1/countries/149", name = "Mali",
location = list(lon = -1.25, lat = 17.35), id = 149L, shortname = "Mali",
iso3 = "mli")), title = "REGIONAL MANAGER West Africa")),
list(id = "3594129", score = 1L, fields = list(date = list(
changed = "2020-05-19T00:04:01+00:00", created = "2020-05-04T15:20:37+00:00",
closing = "2020-05-18T00:00:00+00:00"), title = "Support Relief Group Public Health Advisor (Multiple Positions)")))
如果您一次只看一个元素,我认为 as.data.frame
做得相当不错。虽然我将演示如何使用缩写数据(我已将其编辑到您的问题中),但第一个元素如下所示:
raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))
as.data.frame(raw_jobs_sublist[[1]])
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
以不同方式显示(只是为了这里的多样性),它是
str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of 13 variables:
# $ id : chr "3594134"
# $ score : int 1
# $ fields.date.changed : chr "2020-04-18T00:35:00+00:00"
# $ fields.date.created : chr "2020-04-07T11:15:37+00:00"
# $ fields.date.closing : chr "2020-04-17T00:00:00+00:00"
# $ fields.country.href : chr "https://api.reliefweb.int/v1/countries/149"
# $ fields.country.name : chr "Mali"
# $ fields.country.location.lon: num -1.25
# $ fields.country.location.lat: num 17.4
# $ fields.country.id : int 149
# $ fields.country.shortname : chr "Mali"
# $ fields.country.iso3 : chr "mli"
# $ fields.title : chr "REGIONAL MANAGER West Africa"
为了对所有元素执行此操作,我们需要考虑以下几点:
- 并不是所有的元素都有所有的字段,所以无论我们用什么方法都需要“填”空;
- 我们不想迭代地做,让我们一次把它们结合起来。
这是第一次尝试:
dplyr::bind_rows(lapply(raw_jobs_sublist, as.data.frame))
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
# 2 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Support Relief Group Public Health Advisor (Multiple Positions)
这也适用于 data.table::rbindlist
。它不适用于 do.call(rbind.data.frame, ...)
,因为它对缺失名称的容忍度较低。 (这可以轻松完成,使用这两个选项偶尔会有其他好处。)
注意:如果您在原始数据上执行此操作,R 显示 data.frame
的默认机制会用所有文本占用您的控制台,这可能很烦人。如果您已经在任何工作中使用 dplyr
或 data.table
,那么这两种格式都提供字符串限制,因此在控制台上更容易接受。例如,显示 全部内容:
tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list, as.data.frame)))
# # A tibble: 2 x 42
# id score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
# <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <chr>
# 1 3594~ 1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali -1.25 17.4 149 Mali mli Donor Relations~ 20966 Bamako https://api.rel~ ICCO COOPERATION 45059 Non-governmenta~ 274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~ 1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA> <NA> NA NA NA <NA> <NA> Program/Project~ 6867 <NA> https://api.rel~ US Agency for I~ 1751 Government 271 USAID http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>, fields.type.id <int>, fields.experience.name <chr>, fields.experience.id <int>, fields.url <chr>, fields.url_alias <chr>, fields.how_to_apply <chr>, fields.id <int>, fields.status <chr>, fields.body.html <chr>, fields.how_to_apply.html <chr>, href <chr>, fields.source.longname <chr>, fields.source.spanish_name <chr>,
# # fields.theme.name <chr>, fields.theme.id <int>, fields.theme.name.1 <chr>, fields.theme.id.1 <int>
data.table::rbindlist(lapply(raw_jobs_list, as.data.frame), fill = TRUE)
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.career_categories.name fields.career_categories.id fields.name
# <char> <int> <char> <char> <char> <char> <char> <num> <num> <int> <char> <char> <char> <int> <char>
# 1: 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri... Mali -1.25 17.35 149 Mali mli Donor Relations/Grants Management 20966 Bamako
# 2: 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Program/Project Management 6867 <NA>
# 27 variables not shown: [fields.source.href <char>, fields.source.name <char>, fields.source.id <int>, fields.source.type.name <char>, fields.source.type.id <int>, fields.source.shortname <char>, fields.source.homepage <char>, fields.title <char>, fields.body <char>, fields.type.name <char>, ...]
对于 data.table
,我已经设置了一些选项来促进这一点。值得注意的是,我目前正在使用:
options(
datatable.prettyprint.char = 36,
datatable.print.topn = 10,
datatable.print.class = TRUE,
datatable.print.trunc.cols = TRUE
)
此时,您有一个 data.frame
应该包含所有数据(NA
用于缺少字段的元素)。从这里开始,如果您不喜欢嵌套名称约定(例如 fields.date.changed
),则可以使用模式或常规方法轻松重命名它们。
因此,关于将列表中的项目拉入数据框,这应该是一个相对简单的问题,但我遇到了一些问题。
我有以下列表(我只为您展示了列表的一部分,它比这个长得多):
str(raw_jobs_list)
List of 2
$ :List of 4
..$ id : chr "3594134"
..$ score : int 1
..$ fields:List of 16
.. ..$ date :List of 3
.. .. ..$ changed: chr "2020-04-18T00:35:00+00:00"
.. .. ..$ created: chr "2020-04-07T11:15:37+00:00"
.. .. ..$ closing: chr "2020-04-17T00:00:00+00:00"
.. ..$ country :List of 1
.. .. ..$ :List of 6
.. .. .. ..$ href : chr "https://api.reliefweb.int/v1/countries/149"
.. .. .. ..$ name : chr "Mali"
.. .. .. ..$ location :List of 2
.. .. .. .. ..$ lon: num -1.25
.. .. .. .. ..$ lat: num 17.4
.. .. .. ..$ id : int 149
.. .. .. ..$ shortname: chr "Mali"
.. .. .. ..$ iso3 : chr "mli"
.. ..$ title : chr "REGIONAL MANAGER West Africa"
我尝试使用以下方法将它们拉出来:
jobs_data_df <- list.stack(list.select(raw_jobs_list,
fields$title,
fields$country$name,
fields$date$created))
其中 raw_jobs_list 是列表,但我得到了这些 NA,但不确定如何通过它。
glimpse(jobs_data_df)
Rows: 2
Columns: 3
$ V1 <chr> "REGIONAL MANAGER West Africa", "Support Relief Group Public Health Advisor (Multiple Positions)"
$ V2 <lgl> NA, NA
$ V3 <chr> "2020-04-07T11:15:37+00:00", "2020-05-04T15:20:37+00:00"
可能有一些明显的东西我忽略了,因为我以前没有太多地使用过列表。有什么想法吗?
非常感谢! C
PS。如果您有兴趣,我正在使用这个 API,这就是我到目前为止的方式。
jobs <- GET(url = "https://api.reliefweb.int/v1/jobs?appname=apidoc&preset=analysis&profile=full&limit=2")
raw_jobs_list <- content(jobs)$data
上面显示的部分是整个数据的一个子集;这是列表第一个元素的一部分:
dput(lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))]))))
list(list(id = "3594134", score = 1L, fields = list(date = list(
changed = "2020-04-18T00:35:00+00:00", created = "2020-04-07T11:15:37+00:00",
closing = "2020-04-17T00:00:00+00:00"), country = list(list(
href = "https://api.reliefweb.int/v1/countries/149", name = "Mali",
location = list(lon = -1.25, lat = 17.35), id = 149L, shortname = "Mali",
iso3 = "mli")), title = "REGIONAL MANAGER West Africa")),
list(id = "3594129", score = 1L, fields = list(date = list(
changed = "2020-05-19T00:04:01+00:00", created = "2020-05-04T15:20:37+00:00",
closing = "2020-05-18T00:00:00+00:00"), title = "Support Relief Group Public Health Advisor (Multiple Positions)")))
如果您一次只看一个元素,我认为 as.data.frame
做得相当不错。虽然我将演示如何使用缩写数据(我已将其编辑到您的问题中),但第一个元素如下所示:
raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))
as.data.frame(raw_jobs_sublist[[1]])
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
以不同方式显示(只是为了这里的多样性),它是
str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of 13 variables:
# $ id : chr "3594134"
# $ score : int 1
# $ fields.date.changed : chr "2020-04-18T00:35:00+00:00"
# $ fields.date.created : chr "2020-04-07T11:15:37+00:00"
# $ fields.date.closing : chr "2020-04-17T00:00:00+00:00"
# $ fields.country.href : chr "https://api.reliefweb.int/v1/countries/149"
# $ fields.country.name : chr "Mali"
# $ fields.country.location.lon: num -1.25
# $ fields.country.location.lat: num 17.4
# $ fields.country.id : int 149
# $ fields.country.shortname : chr "Mali"
# $ fields.country.iso3 : chr "mli"
# $ fields.title : chr "REGIONAL MANAGER West Africa"
为了对所有元素执行此操作,我们需要考虑以下几点:
- 并不是所有的元素都有所有的字段,所以无论我们用什么方法都需要“填”空;
- 我们不想迭代地做,让我们一次把它们结合起来。
这是第一次尝试:
dplyr::bind_rows(lapply(raw_jobs_sublist, as.data.frame))
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.title
# 1 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149 Mali -1.25 17.35 149 Mali mli REGIONAL MANAGER West Africa
# 2 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Support Relief Group Public Health Advisor (Multiple Positions)
这也适用于 data.table::rbindlist
。它不适用于 do.call(rbind.data.frame, ...)
,因为它对缺失名称的容忍度较低。 (这可以轻松完成,使用这两个选项偶尔会有其他好处。)
注意:如果您在原始数据上执行此操作,R 显示 data.frame
的默认机制会用所有文本占用您的控制台,这可能很烦人。如果您已经在任何工作中使用 dplyr
或 data.table
,那么这两种格式都提供字符串限制,因此在控制台上更容易接受。例如,显示 全部内容:
tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list, as.data.frame)))
# # A tibble: 2 x 42
# id score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
# <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <int> <chr> <chr> <chr> <chr>
# 1 3594~ 1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali -1.25 17.4 149 Mali mli Donor Relations~ 20966 Bamako https://api.rel~ ICCO COOPERATION 45059 Non-governmenta~ 274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~ 1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA> <NA> NA NA NA <NA> <NA> Program/Project~ 6867 <NA> https://api.rel~ US Agency for I~ 1751 Government 271 USAID http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>, fields.type.id <int>, fields.experience.name <chr>, fields.experience.id <int>, fields.url <chr>, fields.url_alias <chr>, fields.how_to_apply <chr>, fields.id <int>, fields.status <chr>, fields.body.html <chr>, fields.how_to_apply.html <chr>, href <chr>, fields.source.longname <chr>, fields.source.spanish_name <chr>,
# # fields.theme.name <chr>, fields.theme.id <int>, fields.theme.name.1 <chr>, fields.theme.id.1 <int>
data.table::rbindlist(lapply(raw_jobs_list, as.data.frame), fill = TRUE)
# id score fields.date.changed fields.date.created fields.date.closing fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3 fields.career_categories.name fields.career_categories.id fields.name
# <char> <int> <char> <char> <char> <char> <char> <num> <num> <int> <char> <char> <char> <int> <char>
# 1: 3594134 1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri... Mali -1.25 17.35 149 Mali mli Donor Relations/Grants Management 20966 Bamako
# 2: 3594129 1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00 <NA> <NA> NA NA NA <NA> <NA> Program/Project Management 6867 <NA>
# 27 variables not shown: [fields.source.href <char>, fields.source.name <char>, fields.source.id <int>, fields.source.type.name <char>, fields.source.type.id <int>, fields.source.shortname <char>, fields.source.homepage <char>, fields.title <char>, fields.body <char>, fields.type.name <char>, ...]
对于 data.table
,我已经设置了一些选项来促进这一点。值得注意的是,我目前正在使用:
options(
datatable.prettyprint.char = 36,
datatable.print.topn = 10,
datatable.print.class = TRUE,
datatable.print.trunc.cols = TRUE
)
此时,您有一个 data.frame
应该包含所有数据(NA
用于缺少字段的元素)。从这里开始,如果您不喜欢嵌套名称约定(例如 fields.date.changed
),则可以使用模式或常规方法轻松重命名它们。