一次性编码 R 中值为 data.frames 的列?

One-hot encode columns where values are data.frames in R?

我有一个 tibble,其中行有 lists 每列值的数据帧,例如

library(tibble)

df = tibble(age = list(data.frame(21), data.frame(57), NULL, data.frame(36)),
            role = list(data.frame('scavenger', 'cleaner'), data.frame('cleaner'), NULL, data.frame('cleaner', 'scavenger', 'hunter')),  
            planet = list(data.frame('jupiter'), data.frame('earth'), data.frame('mars'), data.frame('mars')))
# A tibble: 4 x 3
  age              role             planet          
  <list>           <list>           <list>          
1 <df[,1] [1 × 1]> <df[,2] [1 × 2]> <df[,1] [1 × 1]>
2 <df[,1] [1 × 1]> <df[,1] [1 × 1]> <df[,1] [1 × 1]>
3 <NULL>           <NULL>           <df[,1] [1 × 1]>
4 <df[,1] [1 × 1]> <df[,3] [1 × 3]> <df[,1] [1 × 1]>

我希望对具有数据帧大小为 1 x n (n > 1) 的行的列进行一次性编码,即某些行具有多个值的列(例如 role 列是替换为多个单热编码列 scavengercleanerhunter),另外在它们大小为 1 x 1 的地方用数据框中的单个值替换单元格:

# A tibble: 4 x 5
    age scavenger    cleaner    hunter    planet 
  <dbl> <bool>       <bool>     <bool>    <chr>  
1    21 1            1          0         jupiter
2    57 0            1          0         earth  
3    NA <NA>         <NA>       <NA>      mars   
4    36 1            1          1         mars   

如果每行只有一个值的列不是数据帧,我可以只使用 tidyr 函数,但不幸的是,这将例如为每个 age(不需要)创建不同的列。

我怎样才能做到这一点?

这是实现此目的的函数。

library(functional)
library(tidyr)

# A few helper functions
null_to_na <- function(a) {a[sapply(a, function(x) length(x)==0L)] <- NA
    a}

null_to_dfna <- function(a) {a[sapply(a, function(x) length(x)==0L)] <- data.frame(NA)
    a}

col_has_multiple_vals <- function(a) {NROW(unlist(null_to_na(a))) > NROW(a)}
# Main function
unnest_plus <- function(df) {
    condition = sapply(df, col_has_multiple_vals)

    multi_cols = which(condition)
    singl_cols = which(!condition)

    # Convert 1x1 df col entries to their values
    df[singl_cols] = lapply(df[singl_cols], function(x) unlist(null_to_na(x)))

    # unnest will drop rows with NULL values, so clean them first
    df[multi_cols] = lapply(df[multi_cols], function(x) null_to_dfna(x))

    # unnest / one-hot encode cols with multiple vals per cell
    df = unnest(df, cols=all_of(multi_cols))

    # drop old cols due to unnest behaving weirdly with NA's
    df = df[!(names(df) %in% names(condition[multi_cols]))]

    df
}
>>> unnest_plus(df)

# A tibble: 4 x 5
    age X.scavenger. X.cleaner. X.hunter. planet 
  <dbl> <chr>        <chr>      <chr>     <chr>  
1    21 scavenger    cleaner    <NA>      jupiter
2    57 <NA>         cleaner    <NA>      earth  
3    NA <NA>         <NA>       <NA>      mars   
4    36 scavenger    cleaner    hunter    mars  

注意: unnest在创建虚拟变量时不区分NA和缺失值(即期望的0)。