使用 R 查找序列
Find the sequence using R
如何编写接受 DNA 序列(作为单个字符串)和数字“n >= 2”的函数以及 returns 包含所有 DNA 子序列(作为字符串)并以三元组“AAA”或“GAA”并以三元组“AGT”结尾,并且在开始和结束之间至少有 2 个且至多“n”个其他三元组。
Q1:
for "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT" and for n=2,
the answer is c=(“GAACCCACTAGT”, “AAATTTGGGAGT”).
Q2:
e.g, n=10
the answer is: c("GAACCCACTAGTATAAAATTTGGGAGT", "AAACCCTTTGGGAGT")
这是一个可能的方法。
它使用基于三个 [A-Z] 的 2 -> n 重复的正则表达式作为它的核心。
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
#for n = 10, this looks like: "(AAA|GAA)([A-Z]{3}){2,10}AGT"
stringr::str_extract_all( dna, regex )
# n = 2
# [[1]]
# [1] "GAACCCACTAGT" "AAATTTGGGAGT"
# n = 10
# [[1]]
# [1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"
如何编写接受 DNA 序列(作为单个字符串)和数字“n >= 2”的函数以及 returns 包含所有 DNA 子序列(作为字符串)并以三元组“AAA”或“GAA”并以三元组“AGT”结尾,并且在开始和结束之间至少有 2 个且至多“n”个其他三元组。
Q1:
for "GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT" and for n=2,
the answer is c=(“GAACCCACTAGT”, “AAATTTGGGAGT”).
Q2:
e.g, n=10
the answer is: c("GAACCCACTAGTATAAAATTTGGGAGT", "AAACCCTTTGGGAGT")
这是一个可能的方法。
它使用基于三个 [A-Z] 的 2 -> n 重复的正则表达式作为它的核心。
library( stringr )
#sample data
dna <- c("GAACCCACTAGTATAAAATTTGGGAGTCCCAAACCCTTTGGGAGT")
#set constants
start <- c("AAA", "GAA")
end <- "AGT"
n <- 10 # << set as desired
#build regex
regex <- paste0( "(", paste0( start, collapse = "|" ), ")", paste0( "([A-Z]{3}){2,", n, "}" ), end )
#for n = 10, this looks like: "(AAA|GAA)([A-Z]{3}){2,10}AGT"
stringr::str_extract_all( dna, regex )
# n = 2
# [[1]]
# [1] "GAACCCACTAGT" "AAATTTGGGAGT"
# n = 10
# [[1]]
# [1] "GAACCCACTAGTATAAAATTTGGGAGT" "AAACCCTTTGGGAGT"