基于一行中的条件的数据框列的唯一组合
Unique combinations on dataframe columns based on criteria from one row
我有一个 data.frame
超过 200 列,并且在下面包含了一个子集,包括与这个问题相关的列:
>df
Variant Pos ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
variant5 1234567 A 5 5 1/0 1/0 1/0 1/1 1/1 0/0 1/0 0/0 1/0 1/1
. . . . . F1 F1 F1 F2 F2 F3 F4 F4 F4 F5
我愿意:
1. 对 samples1-sample10 列进行所有可能的组合,其中每个组合包含来自每个 F 数的一个样本,即每个组合包含 5 个样本,每个样本来自 F1 , F2, F3, F4, F5.
所以在上面的例子中会有18种组合,例如:
第一个组合是样本 1、样本 4、样本 6、样本 7、样本 10
第二个组合是样本 1、样本 4、样本 6、样本 8、样本 10
第三个组合是样本 1、样本 4、样本 6、样本 9、样本 10
我在阅读相关帖子后尝试了 unique
、duplicated
和 distinct
,但一无所获。
然后我想将每个唯一组合输出到一个新的 data.frame
,对样本中样本中的每个变量执行计数并将结果输出到新列并执行如下所示的费舍尔精确检验并输出到一个新列,如下所示,下面的代码应该可以做到这一点:(在这里学习的费舍尔代码:)
df.combo.1$pop.0/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE) )
df.combo.1$pop.1/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/0",u))==TRUE) )
df.combo.1$pop.1/1.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE) )
df.combo.1$pop.0.count <- ( 2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE) )) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
df.combo.1$pop.1.count <- ( 2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE) )) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
res <- NULL
for (i in 1:nrow(df.combo.1)){
table <- matrix(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 14], df.combo.1[i, 15]), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value
# save all p values in a vector
res <- c(res,p)
}
df.combo.1$fishers <- res
>df.combo.1
Variant Pos ID DB.0.count DB.1.count sample1 sample4 sample6 sample7 sample10 pop.0/0.count pop.1/0.count pop.1/1.count pop.0.count pop.1.count fishers
variant5 1234567 A 5 5 1/0 1/1 0/0 1/0 1/1 1 2 2 4 6 1.0000
. . . . . F1 F2 F3 F4 F5
2. 最后我想创建一个 data.frame
它列出了每个独特组合的渔夫精确 p 值,如下所示:
>new.df
combo fishers
1 1.0000
2 1.0000
3 1.0000
4 1.0000
etc
我认为整个练习可能需要某种 for 循环?
我想我已经掌握了你想要的东西。对于我认为您在第 1 部分中遇到的问题,我使用了 which 和 expand.grid 的组合来解决。
对于第 2 部分来说,一旦将数据排列在每个观察的 1 行上,这就是一个相当容易的 cbind。
看起来你每次观察使用 2 行(除非那只是格式化的东西),这真的很难(但并非不可能,只是需要更多的杂耍)所以我将数据合并到一行.这应该是一个非常简单的转换,只需将每个 'second' 行的适当列附加到每个 'first' 行,然后每隔一行删除一次。
这可以更高效、更整洁地完成,但我认为这可行,并且应该相当容易扩展到其他情况。
此致,
乔什
# provided demo data
# Variant Pos ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
# variant5 1234567 A 5 5 1/0 1/0 1/0 1/1 1/1 0/0 1/0 0/0 1/0 1/1
# . . . . . F1 F1 F1 F2 F2 F3 F4 F4 F4 F5
# create data frame in long format
test.df <- as.data.frame(t(c("variant5",1234567,"A",5,5,"1/0","1/0","1/0","1/1","1/1","0/0","1/0","0/0","1/0","1/1","F1", "F1", "F1", "F2", "F2", "F3", "F4", "F4", "F4", "F5")))
# ensure as character format
test.df[] <- lapply(test.df, as.character)
# get postions of "F" data
F1.var <- which(test.df =="F1")
F2.var <- which(test.df =="F2")
F3.var <- which(test.df =="F3")
F4.var <- which(test.df =="F4")
F5.var <- which(test.df =="F5")
# get all combinations of the 5 F positions
Fcode.combinations <- expand.grid(F1.var,F2.var,F3.var,F4.var,F5.var)
# create results data frame
df.combo.1 <- as.data.frame(matrix(NA,ncol = 21, nrow = nrow(Fcode.combinations)))
# name variables
names(df.combo.1) <- c("Variant","Pos","ID","DB.0.count","DB.1.count",
"F1.sample.pos","F1.result",
"F2.sample.pos","F2.result",
"F3.sample.pos","F3.result",
"F4.sample.pos","F4.result",
"F5.sample.pos","F5.result",
"pop.0_0.count","pop.1_0.count","pop.1_1.count",
"pop.0.count","pop.1.count",
"fishers")
# copy in common data
df.combo.1[,1:5] <- test.df[,1:5]
# setup variables based on combination data
for(i in 1:nrow(Fcode.combinations)){
df.combo.1[i,c(6,8,10,12,14)] <- Fcode.combinations[i,]
# -10 to correct for the position of the results not the 'F type' data
cycle.results <- as.numeric(Fcode.combinations[i,] -10)
df.combo.1[i,c(7,9,11,13,15)] <- test.df[cycle.results]
}
# this is essentially your code with the column reference changed
df.combo.1$pop.0_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE) )
df.combo.1$pop.1_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/0",u))==TRUE) )
df.combo.1$pop.1_1.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE) )
df.combo.1$pop.0.count <- ( 2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE) )) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
df.combo.1$pop.1.count <- ( 2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE) )) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
res <- NULL
for (i in 1:nrow(df.combo.1)){
table <- matrix(as.numeric(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 16], df.combo.1[i, 17])), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value
# save all p values in a vector
res <- c(res,p)
}
df.combo.1$fishers <- res
# create results data
df.combo.1.results <- as.data.frame(cbind(1:nrow(df.combo.1),df.combo.1$fishers))
names(df.combo.1.results) <- c("combo","fishers")
我有一个 data.frame
超过 200 列,并且在下面包含了一个子集,包括与这个问题相关的列:
>df
Variant Pos ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
variant5 1234567 A 5 5 1/0 1/0 1/0 1/1 1/1 0/0 1/0 0/0 1/0 1/1
. . . . . F1 F1 F1 F2 F2 F3 F4 F4 F4 F5
我愿意:
1. 对 samples1-sample10 列进行所有可能的组合,其中每个组合包含来自每个 F 数的一个样本,即每个组合包含 5 个样本,每个样本来自 F1 , F2, F3, F4, F5.
所以在上面的例子中会有18种组合,例如:
第一个组合是样本 1、样本 4、样本 6、样本 7、样本 10
第二个组合是样本 1、样本 4、样本 6、样本 8、样本 10
第三个组合是样本 1、样本 4、样本 6、样本 9、样本 10
我在阅读相关帖子后尝试了 unique
、duplicated
和 distinct
,但一无所获。
然后我想将每个唯一组合输出到一个新的 data.frame
,对样本中样本中的每个变量执行计数并将结果输出到新列并执行如下所示的费舍尔精确检验并输出到一个新列,如下所示,下面的代码应该可以做到这一点:(在这里学习的费舍尔代码:
df.combo.1$pop.0/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE) )
df.combo.1$pop.1/0.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/0",u))==TRUE) )
df.combo.1$pop.1/1.count <- apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE) )
df.combo.1$pop.0.count <- ( 2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/0",u))==TRUE) )) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
df.combo.1$pop.1.count <- ( 2*(apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("1/1",u))==TRUE) )) + apply(df.combo.1[,6:10], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
res <- NULL
for (i in 1:nrow(df.combo.1)){
table <- matrix(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 14], df.combo.1[i, 15]), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value
# save all p values in a vector
res <- c(res,p)
}
df.combo.1$fishers <- res
>df.combo.1
Variant Pos ID DB.0.count DB.1.count sample1 sample4 sample6 sample7 sample10 pop.0/0.count pop.1/0.count pop.1/1.count pop.0.count pop.1.count fishers
variant5 1234567 A 5 5 1/0 1/1 0/0 1/0 1/1 1 2 2 4 6 1.0000
. . . . . F1 F2 F3 F4 F5
2. 最后我想创建一个 data.frame
它列出了每个独特组合的渔夫精确 p 值,如下所示:
>new.df
combo fishers
1 1.0000
2 1.0000
3 1.0000
4 1.0000
etc
我认为整个练习可能需要某种 for 循环?
我想我已经掌握了你想要的东西。对于我认为您在第 1 部分中遇到的问题,我使用了 which 和 expand.grid 的组合来解决。
对于第 2 部分来说,一旦将数据排列在每个观察的 1 行上,这就是一个相当容易的 cbind。
看起来你每次观察使用 2 行(除非那只是格式化的东西),这真的很难(但并非不可能,只是需要更多的杂耍)所以我将数据合并到一行.这应该是一个非常简单的转换,只需将每个 'second' 行的适当列附加到每个 'first' 行,然后每隔一行删除一次。
这可以更高效、更整洁地完成,但我认为这可行,并且应该相当容易扩展到其他情况。
此致, 乔什
# provided demo data
# Variant Pos ID DB.0.count DB.1.count sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
# variant5 1234567 A 5 5 1/0 1/0 1/0 1/1 1/1 0/0 1/0 0/0 1/0 1/1
# . . . . . F1 F1 F1 F2 F2 F3 F4 F4 F4 F5
# create data frame in long format
test.df <- as.data.frame(t(c("variant5",1234567,"A",5,5,"1/0","1/0","1/0","1/1","1/1","0/0","1/0","0/0","1/0","1/1","F1", "F1", "F1", "F2", "F2", "F3", "F4", "F4", "F4", "F5")))
# ensure as character format
test.df[] <- lapply(test.df, as.character)
# get postions of "F" data
F1.var <- which(test.df =="F1")
F2.var <- which(test.df =="F2")
F3.var <- which(test.df =="F3")
F4.var <- which(test.df =="F4")
F5.var <- which(test.df =="F5")
# get all combinations of the 5 F positions
Fcode.combinations <- expand.grid(F1.var,F2.var,F3.var,F4.var,F5.var)
# create results data frame
df.combo.1 <- as.data.frame(matrix(NA,ncol = 21, nrow = nrow(Fcode.combinations)))
# name variables
names(df.combo.1) <- c("Variant","Pos","ID","DB.0.count","DB.1.count",
"F1.sample.pos","F1.result",
"F2.sample.pos","F2.result",
"F3.sample.pos","F3.result",
"F4.sample.pos","F4.result",
"F5.sample.pos","F5.result",
"pop.0_0.count","pop.1_0.count","pop.1_1.count",
"pop.0.count","pop.1.count",
"fishers")
# copy in common data
df.combo.1[,1:5] <- test.df[,1:5]
# setup variables based on combination data
for(i in 1:nrow(Fcode.combinations)){
df.combo.1[i,c(6,8,10,12,14)] <- Fcode.combinations[i,]
# -10 to correct for the position of the results not the 'F type' data
cycle.results <- as.numeric(Fcode.combinations[i,] -10)
df.combo.1[i,c(7,9,11,13,15)] <- test.df[cycle.results]
}
# this is essentially your code with the column reference changed
df.combo.1$pop.0_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE) )
df.combo.1$pop.1_0.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/0",u))==TRUE) )
df.combo.1$pop.1_1.count <- apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE) )
df.combo.1$pop.0.count <- ( 2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/0",u))==TRUE) )) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
df.combo.1$pop.1.count <- ( 2*(apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("1/1",u))==TRUE) )) + apply(df.combo.1[,c(7,9,11,13,15)], 1, function(u) length(which(grepl("0/1",u))==TRUE) ) )
res <- NULL
for (i in 1:nrow(df.combo.1)){
table <- matrix(as.numeric(c(df.combo.1[i, 4], df.combo.1[i, 5], df.combo.1[i, 16], df.combo.1[i, 17])), ncol = 2, byrow = TRUE)
# if any NA occurs in your table save an error in p else run the fisher test
if(any(is.na(table))) p <- "error" else p <- fisher.test(table)$p.value
# save all p values in a vector
res <- c(res,p)
}
df.combo.1$fishers <- res
# create results data
df.combo.1.results <- as.data.frame(cbind(1:nrow(df.combo.1),df.combo.1$fishers))
names(df.combo.1.results) <- c("combo","fishers")