获取具有不明确碱基 R 的 DNA 序列的所有可能排列
Get all possible permutations of a DNA sequence with an ambiguous base R
假设我有一个碱基不明确的 DNA 序列,N
,其中 N 可以代表任何碱基(它是一个灵活的位置)。
dna.seq <- 'ATGCN'
我想要一个包含所有可能代表的 DNA 序列的载体。它看起来像:
c('ATGCA','ATGCT','ATGCG','ATGCC')
该解决方案还需要考虑具有多个 N
个字符的 DNA 序列,这将创建更多潜在的 DNA 序列。
来自 data.table
的 CJ
可以在此处为您提供帮助:
library(data.table)
dna.seq <- 'ATGCN'
# split into components
l = tstrsplit(dna.seq, '', fixed = TRUE)
# replace N with all possibilities
all_bases = c('A', 'T', 'C', 'G')
l = lapply(l, function(x) if (x == 'N') all_bases else x)
# use CJ and reduce to strings:
Reduce(paste0, do.call(CJ, l))
# [1] "ATGCA" "ATGCC" "ATGCG" "ATGCT"
处理多个的灵活性N
:
dna.seq <- 'ATNCN'
Reduce(paste0, do.call(CJ, l))
# [1] "ATACA" "ATACC" "ATACG" "ATACT" "ATCCA" "ATCCC" "ATCCG" "ATCCT"
# [9] "ATGCA" "ATGCC" "ATGCG" "ATGCT" "ATTCA" "ATTCC" "ATTCG" "ATTCT"
如果您想删除 data.table
依赖项,您可以将 tstrsplit
替换为 t(strsplit())
,将 CJ
替换为 expand.grid
;你只会牺牲计算速度。
dna.seq <- 'ATNGCN'
dna.seq.copy = dna.seq
while(grepl("N", dna.seq.copy[1])){
dna.seq.copy = as.vector(sapply(c("A", "C", "T", "G"), function(x) sub("N", x, dna.seq.copy)))
}
dna.seq.copy
# [1] "ATAGCA" "ATCGCA" "ATTGCA" "ATGGCA" "ATAGCC" "ATCGCC" "ATTGCC" "ATGGCC" "ATAGCT" "ATCGCT" "ATTGCT" "ATGGCT" "ATAGCG" "ATCGCG" "ATTGCG"
#[16] "ATGGCG"
假设我有一个碱基不明确的 DNA 序列,N
,其中 N 可以代表任何碱基(它是一个灵活的位置)。
dna.seq <- 'ATGCN'
我想要一个包含所有可能代表的 DNA 序列的载体。它看起来像:
c('ATGCA','ATGCT','ATGCG','ATGCC')
该解决方案还需要考虑具有多个 N
个字符的 DNA 序列,这将创建更多潜在的 DNA 序列。
data.table
的 CJ
可以在此处为您提供帮助:
library(data.table)
dna.seq <- 'ATGCN'
# split into components
l = tstrsplit(dna.seq, '', fixed = TRUE)
# replace N with all possibilities
all_bases = c('A', 'T', 'C', 'G')
l = lapply(l, function(x) if (x == 'N') all_bases else x)
# use CJ and reduce to strings:
Reduce(paste0, do.call(CJ, l))
# [1] "ATGCA" "ATGCC" "ATGCG" "ATGCT"
处理多个的灵活性N
:
dna.seq <- 'ATNCN'
Reduce(paste0, do.call(CJ, l))
# [1] "ATACA" "ATACC" "ATACG" "ATACT" "ATCCA" "ATCCC" "ATCCG" "ATCCT"
# [9] "ATGCA" "ATGCC" "ATGCG" "ATGCT" "ATTCA" "ATTCC" "ATTCG" "ATTCT"
如果您想删除 data.table
依赖项,您可以将 tstrsplit
替换为 t(strsplit())
,将 CJ
替换为 expand.grid
;你只会牺牲计算速度。
dna.seq <- 'ATNGCN'
dna.seq.copy = dna.seq
while(grepl("N", dna.seq.copy[1])){
dna.seq.copy = as.vector(sapply(c("A", "C", "T", "G"), function(x) sub("N", x, dna.seq.copy)))
}
dna.seq.copy
# [1] "ATAGCA" "ATCGCA" "ATTGCA" "ATGGCA" "ATAGCC" "ATCGCC" "ATTGCC" "ATGGCC" "ATAGCT" "ATCGCT" "ATTGCT" "ATGGCT" "ATAGCG" "ATCGCG" "ATTGCG"
#[16] "ATGGCG"