R:将嵌套列表中的值拉入数据框时出现问题

R: Issues when pulling values from a nested list into a dataframe

因此,关于将列表中的项目拉入数据框,这应该是一个相对简单的问题,但我遇到了一些问题。

我有以下列表(我只为您展示了列表的一部分,它比这个长得多):

str(raw_jobs_list)

List of 2
 $ :List of 4
  ..$ id    : chr "3594134"
  ..$ score : int 1
  ..$ fields:List of 16
  .. ..$ date             :List of 3
  .. .. ..$ changed: chr "2020-04-18T00:35:00+00:00"
  .. .. ..$ created: chr "2020-04-07T11:15:37+00:00"
  .. .. ..$ closing: chr "2020-04-17T00:00:00+00:00"
  .. ..$ country          :List of 1
  .. .. ..$ :List of 6
  .. .. .. ..$ href     : chr "https://api.reliefweb.int/v1/countries/149"
  .. .. .. ..$ name     : chr "Mali"
  .. .. .. ..$ location :List of 2
  .. .. .. .. ..$ lon: num -1.25
  .. .. .. .. ..$ lat: num 17.4
  .. .. .. ..$ id       : int 149
  .. .. .. ..$ shortname: chr "Mali"
  .. .. .. ..$ iso3     : chr "mli"
  .. ..$ title            : chr "REGIONAL MANAGER West Africa"

我尝试使用以下方法将它们拉出来:

jobs_data_df <- list.stack(list.select(raw_jobs_list, 
                                       fields$title, 
                                       fields$country$name,
                                       fields$date$created))

其中 raw_jobs_list 是列表,但我得到了这些 NA,但不确定如何通过它。

glimpse(jobs_data_df)
Rows: 2
Columns: 3
$ V1 <chr> "REGIONAL MANAGER West Africa", "Support Relief Group Public Health Advisor (Multiple Positions)"
$ V2 <lgl> NA, NA
$ V3 <chr> "2020-04-07T11:15:37+00:00", "2020-05-04T15:20:37+00:00"

可能有一些明显的东西我忽略了,因为我以前没有太多地使用过列表。有什么想法吗?

非常感谢! C

PS。如果您有兴趣,我正在使用这个 API,这就是我到目前为止的方式。

jobs <- GET(url = "https://api.reliefweb.int/v1/jobs?appname=apidoc&preset=analysis&profile=full&limit=2")


raw_jobs_list <- content(jobs)$data

上面显示的部分是整个数据的一个子集;这是列表第一个元素的一部分:

dput(lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))]))))

list(list(id = "3594134", score = 1L, fields = list(date = list(
    changed = "2020-04-18T00:35:00+00:00", created = "2020-04-07T11:15:37+00:00", 
    closing = "2020-04-17T00:00:00+00:00"), country = list(list(
    href = "https://api.reliefweb.int/v1/countries/149", name = "Mali", 
    location = list(lon = -1.25, lat = 17.35), id = 149L, shortname = "Mali", 
    iso3 = "mli")), title = "REGIONAL MANAGER West Africa")), 
    list(id = "3594129", score = 1L, fields = list(date = list(
        changed = "2020-05-19T00:04:01+00:00", created = "2020-05-04T15:20:37+00:00", 
        closing = "2020-05-18T00:00:00+00:00"), title = "Support Relief Group Public Health Advisor (Multiple Positions)")))

如果您一次只看一个元素,我认为 as.data.frame 做得相当不错。虽然我将演示如何使用缩写数据(我已将其编辑到您的问题中),但第一个元素如下所示:

raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))

as.data.frame(raw_jobs_sublist[[1]])
#        id score       fields.date.changed       fields.date.created       fields.date.closing                        fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3                 fields.title
# 1 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149                Mali                       -1.25                       17.35               149                     Mali                 mli REGIONAL MANAGER West Africa

以不同方式显示(只是为了这里的多样性),它是

str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of  13 variables:
#  $ id                         : chr "3594134"
#  $ score                      : int 1
#  $ fields.date.changed        : chr "2020-04-18T00:35:00+00:00"
#  $ fields.date.created        : chr "2020-04-07T11:15:37+00:00"
#  $ fields.date.closing        : chr "2020-04-17T00:00:00+00:00"
#  $ fields.country.href        : chr "https://api.reliefweb.int/v1/countries/149"
#  $ fields.country.name        : chr "Mali"
#  $ fields.country.location.lon: num -1.25
#  $ fields.country.location.lat: num 17.4
#  $ fields.country.id          : int 149
#  $ fields.country.shortname   : chr "Mali"
#  $ fields.country.iso3        : chr "mli"
#  $ fields.title               : chr "REGIONAL MANAGER West Africa"

为了对所有元素执行此操作,我们需要考虑以下几点:

  • 并不是所有的元素都有所有的字段,所以无论我们用什么方法都需要“填”空;
  • 我们不想迭代地做,让我们一次把它们结合起来。

这是第一次尝试:

dplyr::bind_rows(lapply(raw_jobs_sublist, as.data.frame))
#        id score       fields.date.changed       fields.date.created       fields.date.closing                        fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3                                                    fields.title
# 1 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149                Mali                       -1.25                       17.35               149                     Mali                 mli                                    REGIONAL MANAGER West Africa
# 2 3594129     1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00                                       <NA>                <NA>                          NA                          NA                NA                     <NA>                <NA> Support Relief Group Public Health Advisor (Multiple Positions)

这也适用于 data.table::rbindlist。它不适用于 do.call(rbind.data.frame, ...),因为它对缺失名称的容忍度较低。 (这可以轻松完成,使用这两个选项偶尔会有其他好处。)

注意:如果您在原始数据上执行此操作,R 显示 data.frame 的默认机制会用所有文本占用您的控制台,这可能很烦人。如果您已经在任何工作中使用 dplyrdata.table,那么这两种格式都提供字符串限制,因此在控制台上更容易接受。例如,显示 全部内容:

tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list, as.data.frame)))
# # A tibble: 2 x 42
#   id    score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
#   <chr> <int> <chr>            <chr>            <chr>            <chr>            <chr>                       <dbl>            <dbl>            <int> <chr>            <chr>            <chr>                       <int> <chr>       <chr>            <chr>                       <int> <chr>                       <int> <chr>            <chr>            <chr>        <chr>      
# 1 3594~     1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali                        -1.25             17.4              149 Mali             mli              Donor Relations~            20966 Bamako      https://api.rel~ ICCO COOPERATION            45059 Non-governmenta~              274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~     1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA>             <NA>                        NA                NA                 NA <NA>             <NA>             Program/Project~             6867 <NA>        https://api.rel~ US Agency for I~             1751 Government                    271 USAID            http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>, fields.type.id <int>, fields.experience.name <chr>, fields.experience.id <int>, fields.url <chr>, fields.url_alias <chr>, fields.how_to_apply <chr>, fields.id <int>, fields.status <chr>, fields.body.html <chr>, fields.how_to_apply.html <chr>, href <chr>, fields.source.longname <chr>, fields.source.spanish_name <chr>,
# #   fields.theme.name <chr>, fields.theme.id <int>, fields.theme.name.1 <chr>, fields.theme.id.1 <int>

data.table::rbindlist(lapply(raw_jobs_list, as.data.frame), fill = TRUE)
#         id score       fields.date.changed       fields.date.created       fields.date.closing                     fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3     fields.career_categories.name fields.career_categories.id fields.name
#     <char> <int>                    <char>                    <char>                    <char>                                  <char>              <char>                       <num>                       <num>             <int>                   <char>              <char>                            <char>                       <int>      <char>
# 1: 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri...                Mali                       -1.25                       17.35               149                     Mali                 mli Donor Relations/Grants Management                       20966      Bamako
# 2: 3594129     1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00                                    <NA>                <NA>                          NA                          NA                NA                     <NA>                <NA>        Program/Project Management                        6867        <NA>
# 27 variables not shown: [fields.source.href <char>, fields.source.name <char>, fields.source.id <int>, fields.source.type.name <char>, fields.source.type.id <int>, fields.source.shortname <char>, fields.source.homepage <char>, fields.title <char>, fields.body <char>, fields.type.name <char>, ...]

对于 data.table,我已经设置了一些选项来促进这一点。值得注意的是,我目前正在使用:

options(
  datatable.prettyprint.char = 36,
  datatable.print.topn = 10,
  datatable.print.class = TRUE,
  datatable.print.trunc.cols = TRUE
)

此时,您有一个 data.frame 应该包含所有数据(NA 用于缺少字段的元素)。从这里开始,如果您不喜欢嵌套名称约定(例如 fields.date.changed),则可以使用模式或常规方法轻松重命名它们。