跨列汇总数据框的快速方法
Fast way to summarize a data frame across columns
我有这 data.frame
五种可能的 character
状态 (genotypes
):
genotypes <- c("0/0","1/1","0/1","1/0","./.")
library(dplyr)
set.seed(1)
df <- do.call(rbind, lapply(1:100, function(i)
matrix(sample(genotypes, 30, replace = T), nrow = 1, dimnames = list(NULL, paste0("V", 1:30))))) %>%
data.frame()
我想将每一行总结为每一行的数量:
ref.hom
(0/0
)
alt.hom
(1/1
)
het
(0/1
或 1/0
)
na
(./.
)
这似乎很慢:
sum.df <- do.call(rbind,lapply(1:nrow(df), function(i){
data.frame(ref.hom = length(which(df[i,] == "0/0")),
alt.hom = length(which(df[i,] == "1/1")),
het = length(which(df[i,] == "0/1") | which(df[i,] == "1/0")),
na = length(which(df[i,] == "./.")))
}))
有没有更有效的方法,也许是基于 dplyr
的方法?
有了dplyr
,你可以试试:
df %>%
transmute(ref.hom = rowSums(. == "0/0"),
alt.hom = rowSums(. == "1/1"),
het = rowSums(. == "0/1") + rowSums(. == "1/0"),
na = rowSums(. == "./."))
ref.hom alt.hom het na
1 4 11 9 6
2 5 2 20 3
3 3 11 10 6
4 5 5 15 5
5 5 4 17 4
6 3 8 13 6
7 6 8 11 5
8 4 8 11 7
9 6 6 14 4
10 14 8 5 3
对于基因分型数据,我会使用 setDT()
。您将节省大量 RAM。
library(data.table)
df$key <- 1:nrow(df)
df <- melt(setDT(df),id.vars = "key")
table(df$key, df$value)
# > head(table(df$key, df$value))
#
# ./. 0/0 0/1 1/0 1/1
# 1 6 6 4 7 7
# 2 6 3 8 5 8
# 3 7 3 5 5 10
# 4 4 8 1 7 10
# 5 5 9 4 3 9
# 6 9 2 6 8 5
# and
table(df$value)
# > table(df$value)
# ./. 0/0 0/1 1/0 1/1
# 620 581 601 584 614
执行时间检查:
> time.taken.DT
Time difference of 0.005386114 secs
> time.taken.dplyr
Time difference of 0.08833909 secs
在基础 R 中,您可以将 apply
与 table
结合使用,这将 return 计算每行中所有可能的级别。
output <- t(apply(df, 1, table))
output
# ./. 0/0 0/1 1/0 1/1
#[1,] 7 8 4 3 8
#[2,] 5 7 4 9 5
#[3,] 6 5 6 5 8
#[4,] 4 7 9 6 4
#[5,] 6 5 6 5 8
#[6,] 8 8 2 7 5
#....
以后如果需要,您可以将列合并为一个级别 output[, 3] + output[, 4]
。
另一种选择是 gather
将数据转换为长格式和 count
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::gather(key, value, -row) %>%
count(row, value)
#If needed
#tidyr::spread(value, n)
我有这 data.frame
五种可能的 character
状态 (genotypes
):
genotypes <- c("0/0","1/1","0/1","1/0","./.")
library(dplyr)
set.seed(1)
df <- do.call(rbind, lapply(1:100, function(i)
matrix(sample(genotypes, 30, replace = T), nrow = 1, dimnames = list(NULL, paste0("V", 1:30))))) %>%
data.frame()
我想将每一行总结为每一行的数量:
ref.hom
(0/0
)alt.hom
(1/1
)het
(0/1
或1/0
)na
(./.
)
这似乎很慢:
sum.df <- do.call(rbind,lapply(1:nrow(df), function(i){
data.frame(ref.hom = length(which(df[i,] == "0/0")),
alt.hom = length(which(df[i,] == "1/1")),
het = length(which(df[i,] == "0/1") | which(df[i,] == "1/0")),
na = length(which(df[i,] == "./.")))
}))
有没有更有效的方法,也许是基于 dplyr
的方法?
有了dplyr
,你可以试试:
df %>%
transmute(ref.hom = rowSums(. == "0/0"),
alt.hom = rowSums(. == "1/1"),
het = rowSums(. == "0/1") + rowSums(. == "1/0"),
na = rowSums(. == "./."))
ref.hom alt.hom het na
1 4 11 9 6
2 5 2 20 3
3 3 11 10 6
4 5 5 15 5
5 5 4 17 4
6 3 8 13 6
7 6 8 11 5
8 4 8 11 7
9 6 6 14 4
10 14 8 5 3
对于基因分型数据,我会使用 setDT()
。您将节省大量 RAM。
library(data.table)
df$key <- 1:nrow(df)
df <- melt(setDT(df),id.vars = "key")
table(df$key, df$value)
# > head(table(df$key, df$value))
#
# ./. 0/0 0/1 1/0 1/1
# 1 6 6 4 7 7
# 2 6 3 8 5 8
# 3 7 3 5 5 10
# 4 4 8 1 7 10
# 5 5 9 4 3 9
# 6 9 2 6 8 5
# and
table(df$value)
# > table(df$value)
# ./. 0/0 0/1 1/0 1/1
# 620 581 601 584 614
执行时间检查:
> time.taken.DT
Time difference of 0.005386114 secs
> time.taken.dplyr
Time difference of 0.08833909 secs
在基础 R 中,您可以将 apply
与 table
结合使用,这将 return 计算每行中所有可能的级别。
output <- t(apply(df, 1, table))
output
# ./. 0/0 0/1 1/0 1/1
#[1,] 7 8 4 3 8
#[2,] 5 7 4 9 5
#[3,] 6 5 6 5 8
#[4,] 4 7 9 6 4
#[5,] 6 5 6 5 8
#[6,] 8 8 2 7 5
#....
以后如果需要,您可以将列合并为一个级别 output[, 3] + output[, 4]
。
另一种选择是 gather
将数据转换为长格式和 count
library(dplyr)
df %>%
mutate(row = row_number()) %>%
tidyr::gather(key, value, -row) %>%
count(row, value)
#If needed
#tidyr::spread(value, n)