R 是否有办法找出哪些数字对总和有贡献? (去卷积?)
Does R have a way to figure out which numbers contributed to a sum? (deconvolution?)
我正在处理来自研究荟萃分析的数据,其中并非每项研究都有每个遗传变异的数据。我正在尝试解决以下问题:
- 我有一个数据框(“研究”),其中包含有关 10 项研究的信息,其中包括每项研究中的案例和对照的数量以及总研究样本量(N = 案例 + 对照)。
- 我还有一个包含大约 1000 万行的数据框(“原始结果”),每一行都是通过对 table 研究中样本量 (N) 的一些未知子集求和生成的。
- 我需要弄清楚哪些研究相加才能得出每个总数
以下是数据的示例:
> head(studies,2)
study cases controls N
1 A 3747 8024 11771
2 B 5367 5780 11147
> head(rawresults,2)
ID N
1 rs58241367 65280
2 rs85436064 107624
这是我想从中得到的一个例子(可选的 contributing_studies 列,如果这样做允许更好的解决方案,我可以去掉它):
> head(final,2)
ID contributing_studies cases controls N
1 rs10685984 B,C,D,F,G 26221 19987 46208
2 rs12123751 A,C,D,G,J 25631 23509 49140
到目前为止,我解决这个问题的最好办法是暴力破解。对于给定的总数,十项研究中的每一项都有两种可能的状态(有贡献与无贡献),因此有 2^10 = 1024 个可能的总和。有些总和可能不是唯一的(可能有不止一个研究 N 的组合可以创建该总和),我计划将那些不明确的排除在外。我在下面包含了代码和该解决方案的示例作为答案。
我想问的是:R有更好的方法吗?也许存在用于处理此类问题的某些库或函数?或者我可以做些什么来让它更快更高效?
下面是模拟场景数据的代码:
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
rawresults <- data.frame(character(length=50),numeric(length=50),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
for(i in seq(1,50)) {
numstudies <- sample(seq(5,10),1)
rawresults[i,'N'] <- sum(sample(studies$N,numstudies))
rawresults[i,'ID'] <- paste0('rs',sample(seq(1,99999999),1))
}
编辑:为场景模拟数据的更快代码,因此可以模拟数百万行原始结果。受以下 Allan Cameron 解决方案的启发,该解决方案也使用 combn.
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
num_results <- 50 # Number of results to simulate
possible_ns <- unlist(sapply(1:10,combn,x=studies$N,sum))
rawresults <- data.frame(paste0('rs',sample(1:99999999,num_results)),sample(possible_ns,num_results,rep=T),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
这是我使用蛮力方法得出的解决方案。
我就是问问题的人,我希望有人能有比我更好的解决方案,例如。 R 中的一些东西意味着可以应用于这个问题的总和的反卷积(我认为这是正确的词?)。
#### Bruce-force generate all possible combinations of studies ####
sumstudies <- function(whichstudies,whichcolumn) {
# Convert integer "whichstudies" to binary, then use the binary digits to decide which studies are included or excluded for this combination
in_or_out <- as.logical(intToBits(whichstudies)[1:10])
# Return appropriate combination of data from included studies (sum if numeric, paste otherwise)
if(is.numeric(studies[,whichcolumn])) {
return(sum(studies[in_or_out,whichcolumn]))
} else {
return(paste(studies[in_or_out,whichcolumn],collapse=','))
}
}
# Create a data frame with all 1024 possible combinations of studies
allcombos <- data.frame(matrix(nrow=1024,ncol=4))
colnames(allcombos) <- c('contributing_studies','cases','controls','N')
allcombos$contributing_studies <- sapply(seq(1,1024),sumstudies,'study')
allcombos$N <- sapply(seq(1,1024),sumstudies,'N')
allcombos$cases <- sapply(seq(1,1024),sumstudies,'cases')
allcombos$controls <- sapply(seq(1,1024),sumstudies,'controls')
# Get rid of Ns that can be made by summing more than one different combination of studies, since we wouldn't know which solution was correct
duplicates <- duplicated(allcombos$N) | duplicated(allcombos$N,fromLast=T)
allcombos[duplicates,] <- NA # Set all affected rows to NA
#### Match the data about all possible combinations to the Ns in rawresults ####
final <- merge(rawresults,allcombos,by='N',all.x=T,all.y=F)
final <- final[,c('ID','contributing_studies','cases','controls','N')]
我认为肯定有一种方法可以通过使用 combn
函数使代码更短一些:
getSum <- function(x) colSums(matrix(studies$N[combn(1:10, x)], nrow = x))
getInd <- function(x) apply(combn(1:10, x), 2, function(y) paste(y, collapse = ", "))
all_sums <- do.call(c, lapply(1:10, getSum))
all_inds <- do.call(c, lapply(1:10, getInd))
rawresults$studies <- all_inds[match(rawresults$N, all_sums)]
这产生:
rawresults
#> ID N studies
#> 1 rs58241367 65280 2, 4, 6, 7, 8, 9
#> 2 rs85436064 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 3 rs64407295 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 4 rs78593369 83683 1, 3, 4, 5, 6, 7, 8, 10
#> 5 rs18774630 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 6 rs99681114 49670 3, 6, 7, 9, 10
#> 7 rs8426283 69694 2, 3, 4, 5, 6, 7, 10
#> 8 rs81116968 75972 2, 4, 5, 6, 7, 8, 10
#> 9 rs54871836 55138 1, 3, 5, 9, 10
#> 10 rs21386862 87919 1, 2, 3, 5, 6, 8, 9, 10
#> 11 rs16179951 73195 1, 2, 3, 4, 6, 8, 9
#> 12 rs8492848 74843 1, 3, 4, 5, 7, 9, 10
#> 13 rs81342050 56555 2, 4, 5, 7, 9
#> 14 rs44945811 59794 1, 2, 3, 6, 7, 10
#> 15 rs43997413 94715 1, 2, 3, 4, 6, 7, 8, 9, 10
#> 16 rs97055078 94715 1, 2, 3, 4, 6, 7, 8, 9, 10
#> 17 rs19941161 51106 1, 3, 4, 5, 10
#> 18 rs30748262 94259 1, 2, 3, 4, 5, 6, 7, 9, 10
#> 19 rs10959349 64914 2, 4, 6, 8, 9, 10
#> 20 rs982457 99355 1, 2, 3, 4, 5, 7, 8, 9, 10
#> 21 rs2007022 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 22 rs21833759 100537 1, 2, 4, 5, 6, 7, 8, 9, 10
#> 23 rs74715222 87172 1, 2, 4, 5, 6, 7, 9, 10
#> 24 rs49182929 54248 4, 5, 6, 7, 8
#> 25 rs95501056 64548 1, 2, 3, 5, 6, 8
#> 26 rs57556390 53270 2, 3, 4, 5, 8
#> 27 rs98150573 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 28 rs12123751 49140 1, 3, 4, 7, 10
#> 29 rs97971902 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 30 rs89722202 50848 2, 3, 4, 5, 7
#> 31 rs98543720 65904 1, 4, 6, 7, 8, 9
#> 32 rs31961269 70318 1, 3, 4, 5, 6, 7, 10
#> 33 rs9764985 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 34 rs10685984 46208 2, 3, 4, 6, 7
#> 35 rs34912847 77631 1, 3, 4, 5, 7, 8, 9
#> 36 rs94227949 76514 2, 3, 5, 6, 7, 8, 9
#> 37 rs74751789 67467 1, 2, 5, 6, 9, 10
#> 38 rs47441925 66565 1, 2, 4, 7, 8, 10
#> 39 rs4502074 69920 2, 4, 5, 7, 8, 9
#> 40 rs2741222 57384 1, 4, 5, 8, 10
#> 41 rs11561555 79017 1, 2, 4, 5, 6, 8, 9
#> 42 rs96740802 85900 1, 3, 4, 5, 6, 7, 8, 9
#> 43 rs7081648 96681 1, 2, 3, 4, 5, 6, 8, 9, 10
#> 44 rs68842412 74957 1, 3, 4, 5, 6, 8, 9
#> 45 rs78557028 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 46 rs75435758 84706 3, 4, 5, 6, 7, 8, 9, 10
#> 47 rs58940608 58751 2, 3, 4, 5, 6, 10
#> 48 rs25060056 66639 2, 5, 6, 7, 9, 10
#> 49 rs98833089 99355 1, 2, 3, 4, 5, 7, 8, 9, 10
#> 50 rs70239332 94259 1, 2, 3, 4, 5, 6, 7, 9, 10
查找 N 的所有组合并生成索引字符串所需的时间不到 10 毫秒。它只执行一次,那么效率就取决于 match
函数的效率,这与 R 为这种算法提供的一样好。整个过程在我的普通机器上运行大约 11 毫秒。
由 reprex package (v0.3.0)
于 2020-06-29 创建
对于稍后通过 Google 搜索此内容的任何人,这是我编写的代码(基于 Allan Cameron 的解决方案),它添加了您在处理遗传荟萃分析数据时可能需要的所有字段。 200万行只用了16秒
allcomb <- function(studies,excl_ambig=T) {
n_studies <- nrow(studies)
N <- unlist(sapply(1:n_studies,combn,x=studies$N,sum))
cases <- unlist(sapply(1:n_studies,combn,x=studies$cases,sum))
controls <- unlist(sapply(1:n_studies,combn,x=studies$controls,sum))
contrib_studies <- unlist(sapply(1:n_studies,combn,x=studies$study,paste,collapse=','))
combined <- data.frame(contrib_studies,cases,controls,N,stringsAsFactors=F)
# Flag ambiguous lines where more than one possible combination of studies exists to produce that sum
combined$ambig <- duplicated(combined$N) | duplicated(combined$N,fromLast=T)
if(excl_ambig) {
combined <- combined[!combined$ambiguous,]
}
return(combined)
}
allcombos <- allcomb(studies)
rawresults[,c('contrib_studies','cases','controls','N')] <- allcombos[match(rawresults$N, allcombos$N),]
接受 Allan 的解决方案作为答案,因为这是他的想法,我只是以此为基础!
我正在处理来自研究荟萃分析的数据,其中并非每项研究都有每个遗传变异的数据。我正在尝试解决以下问题:
- 我有一个数据框(“研究”),其中包含有关 10 项研究的信息,其中包括每项研究中的案例和对照的数量以及总研究样本量(N = 案例 + 对照)。
- 我还有一个包含大约 1000 万行的数据框(“原始结果”),每一行都是通过对 table 研究中样本量 (N) 的一些未知子集求和生成的。
- 我需要弄清楚哪些研究相加才能得出每个总数
以下是数据的示例:
> head(studies,2)
study cases controls N
1 A 3747 8024 11771
2 B 5367 5780 11147
> head(rawresults,2)
ID N
1 rs58241367 65280
2 rs85436064 107624
这是我想从中得到的一个例子(可选的 contributing_studies 列,如果这样做允许更好的解决方案,我可以去掉它):
> head(final,2)
ID contributing_studies cases controls N
1 rs10685984 B,C,D,F,G 26221 19987 46208
2 rs12123751 A,C,D,G,J 25631 23509 49140
到目前为止,我解决这个问题的最好办法是暴力破解。对于给定的总数,十项研究中的每一项都有两种可能的状态(有贡献与无贡献),因此有 2^10 = 1024 个可能的总和。有些总和可能不是唯一的(可能有不止一个研究 N 的组合可以创建该总和),我计划将那些不明确的排除在外。我在下面包含了代码和该解决方案的示例作为答案。
我想问的是:R有更好的方法吗?也许存在用于处理此类问题的某些库或函数?或者我可以做些什么来让它更快更高效?
下面是模拟场景数据的代码:
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
rawresults <- data.frame(character(length=50),numeric(length=50),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
for(i in seq(1,50)) {
numstudies <- sample(seq(5,10),1)
rawresults[i,'N'] <- sum(sample(studies$N,numstudies))
rawresults[i,'ID'] <- paste0('rs',sample(seq(1,99999999),1))
}
编辑:为场景模拟数据的更快代码,因此可以模拟数百万行原始结果。受以下 Allan Cameron 解决方案的启发,该解决方案也使用 combn.
set.seed(1)
# Make "studies"
studies <- data.frame(toupper(letters[1:10]),round(rnorm(10,5000,2000)),round(rnorm(10,5000,2000)),stringsAsFactors=F)
colnames(studies) <- c('study','cases','controls')
studies$N <- studies$cases + studies$controls
# Make "rawresults"
num_results <- 50 # Number of results to simulate
possible_ns <- unlist(sapply(1:10,combn,x=studies$N,sum))
rawresults <- data.frame(paste0('rs',sample(1:99999999,num_results)),sample(possible_ns,num_results,rep=T),stringsAsFactors=F)
colnames(rawresults) <- c('ID','N')
这是我使用蛮力方法得出的解决方案。
我就是问问题的人,我希望有人能有比我更好的解决方案,例如。 R 中的一些东西意味着可以应用于这个问题的总和的反卷积(我认为这是正确的词?)。
#### Bruce-force generate all possible combinations of studies ####
sumstudies <- function(whichstudies,whichcolumn) {
# Convert integer "whichstudies" to binary, then use the binary digits to decide which studies are included or excluded for this combination
in_or_out <- as.logical(intToBits(whichstudies)[1:10])
# Return appropriate combination of data from included studies (sum if numeric, paste otherwise)
if(is.numeric(studies[,whichcolumn])) {
return(sum(studies[in_or_out,whichcolumn]))
} else {
return(paste(studies[in_or_out,whichcolumn],collapse=','))
}
}
# Create a data frame with all 1024 possible combinations of studies
allcombos <- data.frame(matrix(nrow=1024,ncol=4))
colnames(allcombos) <- c('contributing_studies','cases','controls','N')
allcombos$contributing_studies <- sapply(seq(1,1024),sumstudies,'study')
allcombos$N <- sapply(seq(1,1024),sumstudies,'N')
allcombos$cases <- sapply(seq(1,1024),sumstudies,'cases')
allcombos$controls <- sapply(seq(1,1024),sumstudies,'controls')
# Get rid of Ns that can be made by summing more than one different combination of studies, since we wouldn't know which solution was correct
duplicates <- duplicated(allcombos$N) | duplicated(allcombos$N,fromLast=T)
allcombos[duplicates,] <- NA # Set all affected rows to NA
#### Match the data about all possible combinations to the Ns in rawresults ####
final <- merge(rawresults,allcombos,by='N',all.x=T,all.y=F)
final <- final[,c('ID','contributing_studies','cases','controls','N')]
我认为肯定有一种方法可以通过使用 combn
函数使代码更短一些:
getSum <- function(x) colSums(matrix(studies$N[combn(1:10, x)], nrow = x))
getInd <- function(x) apply(combn(1:10, x), 2, function(y) paste(y, collapse = ", "))
all_sums <- do.call(c, lapply(1:10, getSum))
all_inds <- do.call(c, lapply(1:10, getInd))
rawresults$studies <- all_inds[match(rawresults$N, all_sums)]
这产生:
rawresults
#> ID N studies
#> 1 rs58241367 65280 2, 4, 6, 7, 8, 9
#> 2 rs85436064 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 3 rs64407295 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 4 rs78593369 83683 1, 3, 4, 5, 6, 7, 8, 10
#> 5 rs18774630 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 6 rs99681114 49670 3, 6, 7, 9, 10
#> 7 rs8426283 69694 2, 3, 4, 5, 6, 7, 10
#> 8 rs81116968 75972 2, 4, 5, 6, 7, 8, 10
#> 9 rs54871836 55138 1, 3, 5, 9, 10
#> 10 rs21386862 87919 1, 2, 3, 5, 6, 8, 9, 10
#> 11 rs16179951 73195 1, 2, 3, 4, 6, 8, 9
#> 12 rs8492848 74843 1, 3, 4, 5, 7, 9, 10
#> 13 rs81342050 56555 2, 4, 5, 7, 9
#> 14 rs44945811 59794 1, 2, 3, 6, 7, 10
#> 15 rs43997413 94715 1, 2, 3, 4, 6, 7, 8, 9, 10
#> 16 rs97055078 94715 1, 2, 3, 4, 6, 7, 8, 9, 10
#> 17 rs19941161 51106 1, 3, 4, 5, 10
#> 18 rs30748262 94259 1, 2, 3, 4, 5, 6, 7, 9, 10
#> 19 rs10959349 64914 2, 4, 6, 8, 9, 10
#> 20 rs982457 99355 1, 2, 3, 4, 5, 7, 8, 9, 10
#> 21 rs2007022 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 22 rs21833759 100537 1, 2, 4, 5, 6, 7, 8, 9, 10
#> 23 rs74715222 87172 1, 2, 4, 5, 6, 7, 9, 10
#> 24 rs49182929 54248 4, 5, 6, 7, 8
#> 25 rs95501056 64548 1, 2, 3, 5, 6, 8
#> 26 rs57556390 53270 2, 3, 4, 5, 8
#> 27 rs98150573 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 28 rs12123751 49140 1, 3, 4, 7, 10
#> 29 rs97971902 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 30 rs89722202 50848 2, 3, 4, 5, 7
#> 31 rs98543720 65904 1, 4, 6, 7, 8, 9
#> 32 rs31961269 70318 1, 3, 4, 5, 6, 7, 10
#> 33 rs9764985 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 34 rs10685984 46208 2, 3, 4, 6, 7
#> 35 rs34912847 77631 1, 3, 4, 5, 7, 8, 9
#> 36 rs94227949 76514 2, 3, 5, 6, 7, 8, 9
#> 37 rs74751789 67467 1, 2, 5, 6, 9, 10
#> 38 rs47441925 66565 1, 2, 4, 7, 8, 10
#> 39 rs4502074 69920 2, 4, 5, 7, 8, 9
#> 40 rs2741222 57384 1, 4, 5, 8, 10
#> 41 rs11561555 79017 1, 2, 4, 5, 6, 8, 9
#> 42 rs96740802 85900 1, 3, 4, 5, 6, 7, 8, 9
#> 43 rs7081648 96681 1, 2, 3, 4, 5, 6, 8, 9, 10
#> 44 rs68842412 74957 1, 3, 4, 5, 6, 8, 9
#> 45 rs78557028 107624 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> 46 rs75435758 84706 3, 4, 5, 6, 7, 8, 9, 10
#> 47 rs58940608 58751 2, 3, 4, 5, 6, 10
#> 48 rs25060056 66639 2, 5, 6, 7, 9, 10
#> 49 rs98833089 99355 1, 2, 3, 4, 5, 7, 8, 9, 10
#> 50 rs70239332 94259 1, 2, 3, 4, 5, 6, 7, 9, 10
查找 N 的所有组合并生成索引字符串所需的时间不到 10 毫秒。它只执行一次,那么效率就取决于 match
函数的效率,这与 R 为这种算法提供的一样好。整个过程在我的普通机器上运行大约 11 毫秒。
由 reprex package (v0.3.0)
于 2020-06-29 创建对于稍后通过 Google 搜索此内容的任何人,这是我编写的代码(基于 Allan Cameron 的解决方案),它添加了您在处理遗传荟萃分析数据时可能需要的所有字段。 200万行只用了16秒
allcomb <- function(studies,excl_ambig=T) {
n_studies <- nrow(studies)
N <- unlist(sapply(1:n_studies,combn,x=studies$N,sum))
cases <- unlist(sapply(1:n_studies,combn,x=studies$cases,sum))
controls <- unlist(sapply(1:n_studies,combn,x=studies$controls,sum))
contrib_studies <- unlist(sapply(1:n_studies,combn,x=studies$study,paste,collapse=','))
combined <- data.frame(contrib_studies,cases,controls,N,stringsAsFactors=F)
# Flag ambiguous lines where more than one possible combination of studies exists to produce that sum
combined$ambig <- duplicated(combined$N) | duplicated(combined$N,fromLast=T)
if(excl_ambig) {
combined <- combined[!combined$ambiguous,]
}
return(combined)
}
allcombos <- allcomb(studies)
rawresults[,c('contrib_studies','cases','controls','N')] <- allcombos[match(rawresults$N, allcombos$N),]
接受 Allan 的解决方案作为答案,因为这是他的想法,我只是以此为基础!