使用 R 中同一变量的多个二进制度量来旋转 table

Pivoting a table with multiple binary measures of the same variable in R

我有一个 data.frame 包含数百列,其中每一列都是二进制(“是”/“否”),用于测量可能的结果,并且可以折叠成一个简短的变量列表多个选项(折叠变量的 none 有多个 YES)。我试图找出崩溃的原因,但我能想出的唯一解决方案在我的大型数据集上非常不优雅且耗时。

这是一个玩具示例来解释我的意思:

> groceries<-tribble(~item, ~prod_potato, ~prod_apple, ~prod_banana, ~day_monday, ~day_tuesday, ~day_wednesday,
                   1, "N","N","Y","N","N","Y",
                   2, "Y","N","N","N","N","Y",
                   3, "Y","N","N","N","Y","N",
                   4,"N","Y","N","Y","N","N")

# A tibble: 4 x 7
   item prod_potato prod_apple prod_banana day_monday day_tuesday day_wednesday
  <dbl> <chr>       <chr>      <chr>       <chr>      <chr>       <chr>        
1     1 N           N          Y           N          N           Y            
2     2 Y           N          N           N          N           Y            
3     3 Y           N          N           N          Y           N            
4     4 N           Y          N           Y          N           N     

每件商品只能是土豆、香蕉或苹果中的一种,而且只能在特定日期购买,所以这些多列真的没什么用。

我想要的结果是这样的:

    item    prod    day
       1    banana  wednesday
       2    potato  wednesday
       3    potato  tuesday
       4    apple   monday

这是我想出的解决方案,它可以完成工作,但不是很好:

pivot_longer(groceries,2:4,names_to=c("prod"),names_prefix = "prod_") %>% 
     filter(value=="Y") %>% select(-value) %>% 
  pivot_longer(2:4,names_to="day",names_prefix="day_") %>% 
     filter(value=="Y") %>% select(-value)
# A tibble: 4 x 3
   item prod   day      
  <dbl> <chr>  <chr>    
1     1 banana wednesday
2     2 potato wednesday
3     3 potato tuesday  
4     4 apple  monday   

但我 100% 确定有一个不那么麻烦的解决方案,不需要我对一些 20 多个折叠变量重复这个过程。

我绝对理想的解决方案是能够根据 _ 之前的字符串对列进行分组并将其用作列名,并将 _ 之后的字符串作为值时原始变量的值为“YES”。但我愿意使用稍微更手动的解决方案,我每次都确定要分组的列和变量名称。

谁能提出解决方案(最好是 tidyverse -- 我相信 data.table 会有一个超级有效的解决方案,但我永远无法理解它)?

谢谢!

你可以用pivot_longer把它们全部分开成groupoptions,然后就减少到Y的值(我用summarize只是为了删除该列,但可以在此处轻松使用 filter) 和 pivot_wider.

library(dplyr)
library(tidyr)
# library(tidyverse)

groceries %>%
  pivot_longer(-item,
               names_to = c("group", "options"),
               names_sep = "_") %>%
  group_by(item, group) %>%
  summarize(options = options[value == "Y"],
            .groups = "drop") %>%
  pivot_wider(names_from = "group",
              values_from = "options")
#> # A tibble: 4 × 3
#>    item day       prod  
#>   <dbl> <chr>     <chr> 
#> 1     1 wednesday banana
#> 2     2 wednesday potato
#> 3     3 tuesday   potato
#> 4     4 monday    apple
library(data.table)
setDT(groceries)

dt <- melt(groceries, id.vars = c("item"))[value == "Y"]
dt[, c("A", "B") := tstrsplit(variable, "_")]
dcast(dt, item ~ A, value.var = c("B"))

#    item       day   prod
# 1:    1 wednesday banana
# 2:    2 wednesday potato
# 3:    3   tuesday potato
# 4:    4    monday  apple

或one-liner: dcast(melt(groceries, id.vars = c("item"))[value == "Y"][, c("A", "B") := tstrsplit(variable, "_")], item ~ A, value.var = c("B"))

基本 R 选项

dfout <- reshape(
  transform(
    subset(
      cbind(groceries[1], stack(groceries[-1])),
      values == "Y"
    ),
    p = gsub("_.*", "", ind),
    q = gsub(".*_", "", ind)
  )[c("item", "p", "q")],
  direction = "wide",
  idvar = "item",
  timevar = "p"
)

dfout[order(dfout$item), ]

给予

  item q.prod     q.day
9    1 banana wednesday
2    2 potato wednesday
3    3 potato   tuesday
8    4  apple    monday