将二进制列聚合成一列在 R 中需要很长时间
Aggregating binary columns into one column is taking a long time in R
我正在 运行 编写一段代码,之前 讨论了一个包含 150 万行的大数据,运行 花了好几个小时,但还没有完成。
我的数据如下所示:
ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No
我想添加一个显示 ID 去过的所有城市的列,以及一个显示 ID 去过的城市数量的列,如下所示:
ID London Paris Rome All Cities Count of Cities travelled
1 Yes No Yes London, Rome 2
2 No No Yes Rome 1
3 No Yes Yes Paris, Rome 2
4 No Yes No Paris 1
我正在 运行 宁此代码,当我 运行 它在 100 行数据的样本上时工作正常:
cities <- c('London', 'Paris', 'Rome')
df %>%
rowwise %>%
mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
`Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))
有什么方法可以改进这段代码吗?或者缩短 运行ning 时间?
谢谢!
这是一个 tidyverse
方法,没有使用 rowwise()
,众所周知,它非常慢。
library(tidyverse)
cities <- c('London', 'Paris', 'Rome')
df <- read.table(header = T, text = "ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No")
df %>%
mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>%
unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>%
mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)
ID London Paris Rome All Cities Count of Cities travelled
1 1 Yes No Yes London, Rome 2
2 2 No No Yes Rome 1
3 3 No Yes Yes Paris, Rome 2
4 4 No Yes No Paris 1
基于 R 的可能解决方案:
df$Cities <- apply(df, 1, \(x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, \(x) sum(x[-1] == "Yes"))
df
#> ID London Paris Rome Cities N
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1
与 dplyr
和 rowwise
:
library(dplyr)
df %>%
rowwise %>%
mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
N = sum(c_across(2:4) == "Yes")) %>%
ungroup
#> # A tibble: 4 × 6
#> ID London Paris Rome Cities N
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1
我正在 运行 编写一段代码,之前
ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No
我想添加一个显示 ID 去过的所有城市的列,以及一个显示 ID 去过的城市数量的列,如下所示:
ID London Paris Rome All Cities Count of Cities travelled
1 Yes No Yes London, Rome 2
2 No No Yes Rome 1
3 No Yes Yes Paris, Rome 2
4 No Yes No Paris 1
我正在 运行 宁此代码,当我 运行 它在 100 行数据的样本上时工作正常:
cities <- c('London', 'Paris', 'Rome')
df %>%
rowwise %>%
mutate(`All Cities` = toString(names(.[, cities])[which(c_across(all_of(cities)) == 'Yes')]),
`Count of Cities travelled` = sum(c_across(all_of(cities)) == 'Yes'))
有什么方法可以改进这段代码吗?或者缩短 运行ning 时间?
谢谢!
这是一个 tidyverse
方法,没有使用 rowwise()
,众所周知,它非常慢。
library(tidyverse)
cities <- c('London', 'Paris', 'Rome')
df <- read.table(header = T, text = "ID London Paris Rome
1 Yes No Yes
2 No No Yes
3 No Yes Yes
4 No Yes No")
df %>%
mutate(across(cities, ~ifelse(.x == "Yes", cur_column(), NA), .names = "{.col}1")) %>%
unite(`All Cities`, ends_with("1"), sep = ", ", na.rm = T) %>%
mutate(`Count of Cities travelled` = str_count(`All Cities`, ",") + 1)
ID London Paris Rome All Cities Count of Cities travelled
1 1 Yes No Yes London, Rome 2
2 2 No No Yes Rome 1
3 3 No Yes Yes Paris, Rome 2
4 4 No Yes No Paris 1
基于 R 的可能解决方案:
df$Cities <- apply(df, 1, \(x) paste(names(df[-1])[x[-1] == "Yes"], collapse = ", "))
df$N <- apply(df, 1, \(x) sum(x[-1] == "Yes"))
df
#> ID London Paris Rome Cities N
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1
与 dplyr
和 rowwise
:
library(dplyr)
df %>%
rowwise %>%
mutate(Cities = str_c(colnames(df[-1])[c_across(2:4) == "Yes"], collapse = ", "),
N = sum(c_across(2:4) == "Yes")) %>%
ungroup
#> # A tibble: 4 × 6
#> ID London Paris Rome Cities N
#> <int> <chr> <chr> <chr> <chr> <int>
#> 1 1 Yes No Yes London, Rome 2
#> 2 2 No No Yes Rome 1
#> 3 3 No Yes Yes Paris, Rome 2
#> 4 4 No Yes No Paris 1