如何在 R 中用方括号 {} 包围字符串中的多个字符?
How to surround multiple characters in a string with brackets {} in R?
我有一个包含遗传信息的数据集。
structure(list(GenBank.Accession.version = structure(1:2, .Label = c("JH739893",
"JH751134"), class = "factor"), set = c(17L, 116L), snp.po.200.low = c(5480045,
-102), snp.po.200.up = c(5480464, 340), SNP.position = list(c(5480245L,
5480263L), c(98L, 139L)), seq2 = c("TTACATGGCAAGCACTCAATCTGGCTGCAGGGTGTCTGGCCACATACAAAACAAATGCCAAGTCACCTCTTGTCCCAAGGATCAAGACAAATTTGGACAACAAACCACACTGGCAGCCCCCTAGAAGCTTTCAGATATTTTAATGCCATTGAGATGTAGCATCCAGTGTAGACATTATTAGAAGCACAGCAGTTGCACTCGCACCTCCAGGGTGTCCAACATATGCTGGATTCTGGCATTGCTCATGGCAAGTGAGTTGGTGAATTCACAACTAGCCAGGTCATGTCTTCATTGCAGCAGAAAACTCATCAGCATGTCAGGATGAGAAAAGTCAATACAAAGGAAATGTGGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATAGGGGGGTAA",
"AAAAAAAAAAAAAGAAAAGGGAATTTAAGGAGTCCCAGAGACAGGAGAATTCAGGACAATTTGCACCAATCACTTGCTCCTGGAAAGGAAGGTTGGGCTGATTTGGGGTTGGTAAGCACAGACCTTTCATCCGTTCGTAGAAAGAAGGAAAATTAAATCTCATGGCCTGTTTGTGAAAGGAAATTGCCCAGAATAGCTCTGACAGAATAAGCTATTCCACAATAGCTCCCCATGCGGACACTCCAGCCACTTTGTTCCAGGCTAATTAGTGTGCTTCCAAGCGCAGTAATTATCCTGGAAGGGAAATCTCTCCTCTCCCACAAAGAGTGTTTGCATGGAG"
), seq.length = c(16983252L, 753L), pos.list = list(5480045:5480464,
1:340), SNP.pos.in.subset = list(c(201L, 219L), c(98L, 139L
))), .Names = c("GenBank.Accession.version", "set", "snp.po.200.low",
"snp.po.200.up", "SNP.position", "seq2", "seq.length", "pos.list",
"SNP.pos.in.subset"), row.names = c(17L, 116L), class = "data.frame")
数据如下所示:
如您所见,有 2 行,在第一行中,我有一个要修改的基因序列(字符序列)。这个字符串是从一个较长的 DNA 序列中提取出来的(第一个序列的原始长度是 16983252)。
SNP.position 给我字符在原始字符串中的位置。 SNP.pos.in.subset 给了我相同的位置,但仅在子集中(就像我在子集中从 1 开始计数一样)。因此对于第一个序列:5480045 和 5480464 指的是子集序列中的 201 和 219。
我想在位置 201 和 219 周围放置一些大括号,以便轻松定位这些位置的字符。
我创建了一个脚本来执行此操作。
add.target.snp = function(sequences,
pos.start = 200,
pos.end.added = 3,
character.start = "{/",
character.end = "}") {
old = as.character(sequences)
for(i in 1:length(old)){
up.else = SNP.position[i]+pos.end.added
old[i] = gsub(paste0('^(.{',pos.start,'})(.*)$'), paste0('\1',character.start,'\2'), old[i])
old[i] = gsub(paste0('^(.{',up.else, '})(.*)$'), paste0('\1',character.end,'\2'), old[i])
}
return(old)
}
output.target = add.target.snp(sequences = df$seq2,
pos.start = df$SNP.pos.in.subset,
pos.end.added = 3,
character.start = "{/",
character.end = "}")
但是这个脚本returns我这个错误:
Error in gsub(paste0("^(.{", pos.start, "})(.*)$"), paste0("\1", character.start, :
invalid regular expression '^(.{c(201, 219)})(.*)$', reason 'Invalid contents of {}' In addition: Warning message:
In gsub(paste0("^(.{", pos.start, "})(.*)$"), paste0("\1", character.start, :
argument 'pattern' has length > 1 and only the first element will be used
有没有一种方法可以 运行 我的脚本,但包含多个值以用“{/my_value_at_position_201}”和“{/my_value_at_position_219}”包围?
最终结果(对于我显示的数据中的第二行)应该是
AAAAAAAAAAAAAGAAAAGGGAATTTAAGGAGTCCCAGAGACAGGAGAATTCAGGACAATTTGCACCAATCACTTGCTCCTGGAAAGGAAGGTTGGGC{/T}GATTTGGGGTTGGTAAGCACAGACCTTTCATCCGTTCGTA{/G}AAAGAAGGAAAATTAAATCTCATGGCCTGTTTGTGAAAGGAAATTGCCCAGAATAGCTCTGACAGAATAAGCTATTCCACAATAGCTCCCCATGCGGACACTCCAGCCACTTTGTTCCAGGCTAATTAGTGTGCTTCCAAGCGCAGTAATTATCCTGG
我的脚本的另一个问题是,如果我在我的向量中添加一些字符(在我的例子中是 3 个字符:“{/}”),它将移动第二个数字的位置(201、219 + 3)...有没有办法一次添加括号,这样数字就不会改变?
正则表达式是错误的工具。您想要使用子字符串替换。基础 substr
不允许您替换零长度字符串,但类似的东西应该可以工作:
library(stringi)
library(purrr)
add_bits <- function(sequences,
pos.start = 200,
pos.end.added = 3,
character.start = "{/",
character.end = "}"
) {
# this row allows for the fact that your string is growing.
pos.start <- pos.start + c(0, cumsum(rep(nchar(character.start) +
nchar(character.end), length(pos.start) -1)))
for (ps in pos.start) {
stringi::stri_sub(sequences, ps, length = 0) <- character.start
stringi::stri_sub(sequences, ps + pos.end.added, length = 0) <- character.end
}
sequences
}
tmp <- c("abcde", "123456789")
purrr::map2(tmp, list(c(2,5), 3), add_bits)
## [[1]]
## [1] "a{/b}cd{/e}fg"
##
## [[2]]
## [1] "12{/3}4567"
这是我使用基础包的尝试:
add.target.snp = function(sequences, pos.start = NA,
character.start = "{/", character.end = "}"){
# check input
pos.start <- sort(pos.start[ pos.start <= nchar(sequences)])
# split on SNP positions
snps <- substring(
sequences, c(1, pos.start), c(pos.start - 1, nchar(sequences)))
# exclude "" SNP strings
snps <- snps[ snps != "" ]
# take 1st char and wrap, then paste the rest as is
x0 <- ""
if(!1 %in% pos.start){
x0 <- snps[1]
snps <- snps[2:length(snps)]}
res <- sapply(snps, function(snp){
x1 <- substr(snp, 1, 1)
x2 <- substr(snp, 2, max(2, nchar(snp)))
paste0(paste0(character.start, x1, character.end), x2)})
# return
paste(c(x0, res), collapse = "")
}
tmp <- c("abcde", "123456789")
purrr::map2(tmp, list(c(2,5), 3), add.target.snp)
# [[1]]
# [1] "a{/b}cd{/e}"
#
# [[2]]
# [1] "12{/3}456789"
我有一个包含遗传信息的数据集。
structure(list(GenBank.Accession.version = structure(1:2, .Label = c("JH739893",
"JH751134"), class = "factor"), set = c(17L, 116L), snp.po.200.low = c(5480045,
-102), snp.po.200.up = c(5480464, 340), SNP.position = list(c(5480245L,
5480263L), c(98L, 139L)), seq2 = c("TTACATGGCAAGCACTCAATCTGGCTGCAGGGTGTCTGGCCACATACAAAACAAATGCCAAGTCACCTCTTGTCCCAAGGATCAAGACAAATTTGGACAACAAACCACACTGGCAGCCCCCTAGAAGCTTTCAGATATTTTAATGCCATTGAGATGTAGCATCCAGTGTAGACATTATTAGAAGCACAGCAGTTGCACTCGCACCTCCAGGGTGTCCAACATATGCTGGATTCTGGCATTGCTCATGGCAAGTGAGTTGGTGAATTCACAACTAGCCAGGTCATGTCTTCATTGCAGCAGAAAACTCATCAGCATGTCAGGATGAGAAAAGTCAATACAAAGGAAATGTGGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATGGGATAGGGGGGTAA",
"AAAAAAAAAAAAAGAAAAGGGAATTTAAGGAGTCCCAGAGACAGGAGAATTCAGGACAATTTGCACCAATCACTTGCTCCTGGAAAGGAAGGTTGGGCTGATTTGGGGTTGGTAAGCACAGACCTTTCATCCGTTCGTAGAAAGAAGGAAAATTAAATCTCATGGCCTGTTTGTGAAAGGAAATTGCCCAGAATAGCTCTGACAGAATAAGCTATTCCACAATAGCTCCCCATGCGGACACTCCAGCCACTTTGTTCCAGGCTAATTAGTGTGCTTCCAAGCGCAGTAATTATCCTGGAAGGGAAATCTCTCCTCTCCCACAAAGAGTGTTTGCATGGAG"
), seq.length = c(16983252L, 753L), pos.list = list(5480045:5480464,
1:340), SNP.pos.in.subset = list(c(201L, 219L), c(98L, 139L
))), .Names = c("GenBank.Accession.version", "set", "snp.po.200.low",
"snp.po.200.up", "SNP.position", "seq2", "seq.length", "pos.list",
"SNP.pos.in.subset"), row.names = c(17L, 116L), class = "data.frame")
数据如下所示:
如您所见,有 2 行,在第一行中,我有一个要修改的基因序列(字符序列)。这个字符串是从一个较长的 DNA 序列中提取出来的(第一个序列的原始长度是 16983252)。
SNP.position 给我字符在原始字符串中的位置。 SNP.pos.in.subset 给了我相同的位置,但仅在子集中(就像我在子集中从 1 开始计数一样)。因此对于第一个序列:5480045 和 5480464 指的是子集序列中的 201 和 219。
我想在位置 201 和 219 周围放置一些大括号,以便轻松定位这些位置的字符。
我创建了一个脚本来执行此操作。
add.target.snp = function(sequences,
pos.start = 200,
pos.end.added = 3,
character.start = "{/",
character.end = "}") {
old = as.character(sequences)
for(i in 1:length(old)){
up.else = SNP.position[i]+pos.end.added
old[i] = gsub(paste0('^(.{',pos.start,'})(.*)$'), paste0('\1',character.start,'\2'), old[i])
old[i] = gsub(paste0('^(.{',up.else, '})(.*)$'), paste0('\1',character.end,'\2'), old[i])
}
return(old)
}
output.target = add.target.snp(sequences = df$seq2,
pos.start = df$SNP.pos.in.subset,
pos.end.added = 3,
character.start = "{/",
character.end = "}")
但是这个脚本returns我这个错误:
Error in gsub(paste0("^(.{", pos.start, "})(.*)$"), paste0("\1", character.start, :
invalid regular expression '^(.{c(201, 219)})(.*)$', reason 'Invalid contents of {}' In addition: Warning message:
In gsub(paste0("^(.{", pos.start, "})(.*)$"), paste0("\1", character.start, :
argument 'pattern' has length > 1 and only the first element will be used
有没有一种方法可以 运行 我的脚本,但包含多个值以用“{/my_value_at_position_201}”和“{/my_value_at_position_219}”包围?
最终结果(对于我显示的数据中的第二行)应该是
AAAAAAAAAAAAAGAAAAGGGAATTTAAGGAGTCCCAGAGACAGGAGAATTCAGGACAATTTGCACCAATCACTTGCTCCTGGAAAGGAAGGTTGGGC{/T}GATTTGGGGTTGGTAAGCACAGACCTTTCATCCGTTCGTA{/G}AAAGAAGGAAAATTAAATCTCATGGCCTGTTTGTGAAAGGAAATTGCCCAGAATAGCTCTGACAGAATAAGCTATTCCACAATAGCTCCCCATGCGGACACTCCAGCCACTTTGTTCCAGGCTAATTAGTGTGCTTCCAAGCGCAGTAATTATCCTGG
我的脚本的另一个问题是,如果我在我的向量中添加一些字符(在我的例子中是 3 个字符:“{/}”),它将移动第二个数字的位置(201、219 + 3)...有没有办法一次添加括号,这样数字就不会改变?
正则表达式是错误的工具。您想要使用子字符串替换。基础 substr
不允许您替换零长度字符串,但类似的东西应该可以工作:
library(stringi)
library(purrr)
add_bits <- function(sequences,
pos.start = 200,
pos.end.added = 3,
character.start = "{/",
character.end = "}"
) {
# this row allows for the fact that your string is growing.
pos.start <- pos.start + c(0, cumsum(rep(nchar(character.start) +
nchar(character.end), length(pos.start) -1)))
for (ps in pos.start) {
stringi::stri_sub(sequences, ps, length = 0) <- character.start
stringi::stri_sub(sequences, ps + pos.end.added, length = 0) <- character.end
}
sequences
}
tmp <- c("abcde", "123456789")
purrr::map2(tmp, list(c(2,5), 3), add_bits)
## [[1]]
## [1] "a{/b}cd{/e}fg"
##
## [[2]]
## [1] "12{/3}4567"
这是我使用基础包的尝试:
add.target.snp = function(sequences, pos.start = NA,
character.start = "{/", character.end = "}"){
# check input
pos.start <- sort(pos.start[ pos.start <= nchar(sequences)])
# split on SNP positions
snps <- substring(
sequences, c(1, pos.start), c(pos.start - 1, nchar(sequences)))
# exclude "" SNP strings
snps <- snps[ snps != "" ]
# take 1st char and wrap, then paste the rest as is
x0 <- ""
if(!1 %in% pos.start){
x0 <- snps[1]
snps <- snps[2:length(snps)]}
res <- sapply(snps, function(snp){
x1 <- substr(snp, 1, 1)
x2 <- substr(snp, 2, max(2, nchar(snp)))
paste0(paste0(character.start, x1, character.end), x2)})
# return
paste(c(x0, res), collapse = "")
}
tmp <- c("abcde", "123456789")
purrr::map2(tmp, list(c(2,5), 3), add.target.snp)
# [[1]]
# [1] "a{/b}cd{/e}"
#
# [[2]]
# [1] "12{/3}456789"