将二进制列聚合成一列在 R 中需要很长时间

Aggregating binary columns into one column is taking a long time in R

我正在 运行 编写一段代码,之前 讨论了一个包含 150 万行的大数据,运行 花了好几个小时,但还没有完成。 我的数据如下所示:

ID    London   Paris   Rome
1       Yes     No      Yes
2       No      No      Yes
3       No      Yes     Yes
4       No      Yes     No

我想添加一个显示 ID 去过的所有城市的列,以及一个显示 ID 去过的城市数量的列,如下所示:

ID    London   Paris   Rome    All Cities      Count of Cities travelled
1       Yes     No      Yes    London, Rome                2
2       No      No      Yes     Rome                       1
3       No      Yes     Yes    Paris, Rome                 2
4       No      Yes     No     Paris                       1

我正在 运行 宁此代码,当我 运行 它在 100 行数据的样本上时工作正常:

cities <- c('London', 'Paris', 'Rome')

df %>%
  rowwise %>%
  mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
         `Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))

有什么方法可以改进这段代码吗?或者缩短 运行ning 时间?

谢谢!

这是一个 tidyverse 方法,没有使用 rowwise(),众所周知,它非常慢。

library(tidyverse)

cities <- c('London', 'Paris', 'Rome')
df <- read.table(header = T, text = "ID    London   Paris   Rome
1       Yes     No      Yes
2       No      No      Yes
3       No      Yes     Yes
4       No      Yes     No")

df %>% 
  mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>% 
  unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>% 
  mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)

  ID London Paris Rome   All Cities Count of Cities travelled
1  1    Yes    No  Yes London, Rome                         2
2  2     No    No  Yes         Rome                         1
3  3     No   Yes  Yes  Paris, Rome                         2
4  4     No   Yes   No        Paris                         1

基于 R 的可能解决方案:

df$Cities <- apply(df, 1, \(x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, \(x) sum(x[-1] == "Yes"))
df

#>   ID London Paris Rome       Cities N
#> 1  1    Yes    No  Yes London, Rome 2
#> 2  2     No    No  Yes         Rome 1
#> 3  3     No   Yes  Yes  Paris, Rome 2
#> 4  4     No   Yes   No        Paris 1

dplyrrowwise:

library(dplyr)

df %>%
  rowwise %>%
  mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
         N = sum(c_across(2:4) == "Yes")) %>%
  ungroup

#> # A tibble: 4 × 6
#>      ID London Paris Rome  Cities           N
#>   <int> <chr>  <chr> <chr> <chr>        <int>
#> 1     1 Yes    No    Yes   London, Rome     2
#> 2     2 No     No    Yes   Rome             1
#> 3     3 No     Yes   Yes   Paris, Rome      2
#> 4     4 No     Yes   No    Paris            1