使用 R 根据子字符串的第 n 次出现有效地分解字符串
Efficiently break up a string based on the nth occurrence of a substring using R
简介
给定 R 中的一个字符串,是否有可能获得一个矢量化解决方案(即无循环),我们可以将字符串分成块,其中每个块由字符串中第 n 次出现的子字符串决定。
使用可重现示例完成的工作
假设我们有几段著名的 Lorem Ipsum 文本。
library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "
我们想在每个 3rd 出现单词 [=36] 时将此文本分成多个片段=]" in"(包含space是为了与包含"in"的单词区分开来,例如"min") .
我有以下带有 while 循环的解决方案:
# We wish to break up the string at every
# 3rd occurence of the worn "in"
break.character = " in"
break.occurrence = 3
string.list = list()
i = 1
# initialize string to send into the loop
current.string = my.string
while(length(current.string) > 0){
# Enter segment into the list which occurs BEFORE nth occurence character of interest
string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)
# Update next string to exmine.
# Next string to examine is current string AFTER nth occurence of character of interest
current.string = str_after_nth(current.string, break.character, break.occurrence)
i = i + 1
}
我们能够在带有警告的列表中获得所需的输出(警告未显示)
> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"
[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...
[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
目标
是否可以通过矢量化(即使用 apply()
、lapply()
、mapply()
等)改进此解决方案。此外,我当前的解决方案切断了块中子字符串的最后一次出现。
当前的解决方案可能不适用于极长的字符串(例如我们正在寻找第 n 次出现的核苷酸子串的块的 DNA 序列)。
试试这个:
text_split=strsplit(text," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
最后一个条件是为了防止 in
的数量不能被 3 整除。此外,粘贴在 sapply()
中的最后一个 in
在那里,因为我不知道你是什么想用 in
分隔块。
让我知道这是否有效。我会尽量让它更快。它在代码块中保留第三个 in
。如果可行,我也会对其进行更多注释。
library(lipsum)
library(stringi)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
编辑:实际上,使用 stringi
中的 stri_sub
。它将比 substring
更好地扩展。参见:
my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999
microbenchmark::microbenchmark(
sol1 = {
text_split=strsplit(my.string," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
},
sol2 = {
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
},
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941 10
sol2 55.4163 56.40759 58.53444 56.86043 57.03707 71.02974 10
简介
给定 R 中的一个字符串,是否有可能获得一个矢量化解决方案(即无循环),我们可以将字符串分成块,其中每个块由字符串中第 n 次出现的子字符串决定。
使用可重现示例完成的工作
假设我们有几段著名的 Lorem Ipsum 文本。
library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "
我们想在每个 3rd 出现单词 [=36] 时将此文本分成多个片段=]" in"(包含space是为了与包含"in"的单词区分开来,例如"min") .
我有以下带有 while 循环的解决方案:
# We wish to break up the string at every
# 3rd occurence of the worn "in"
break.character = " in"
break.occurrence = 3
string.list = list()
i = 1
# initialize string to send into the loop
current.string = my.string
while(length(current.string) > 0){
# Enter segment into the list which occurs BEFORE nth occurence character of interest
string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)
# Update next string to exmine.
# Next string to examine is current string AFTER nth occurence of character of interest
current.string = str_after_nth(current.string, break.character, break.occurrence)
i = i + 1
}
我们能够在带有警告的列表中获得所需的输出(警告未显示)
> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"
[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...
[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
目标
是否可以通过矢量化(即使用 apply()
、lapply()
、mapply()
等)改进此解决方案。此外,我当前的解决方案切断了块中子字符串的最后一次出现。
当前的解决方案可能不适用于极长的字符串(例如我们正在寻找第 n 次出现的核苷酸子串的块的 DNA 序列)。
试试这个:
text_split=strsplit(text," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
最后一个条件是为了防止 in
的数量不能被 3 整除。此外,粘贴在 sapply()
中的最后一个 in
在那里,因为我不知道你是什么想用 in
分隔块。
让我知道这是否有效。我会尽量让它更快。它在代码块中保留第三个 in
。如果可行,我也会对其进行更多注释。
library(lipsum)
library(stringi)
my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
编辑:实际上,使用 stringi
中的 stri_sub
。它将比 substring
更好地扩展。参见:
my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999
microbenchmark::microbenchmark(
sol1 = {
text_split=strsplit(my.string," in ")[[1]]
l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)
L= list()
L=sapply(Seq, function(x){
paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}
},
sol2 = {
end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)])
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
stri_sub(my.string, start_of_strings, end_of_strings)
},
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval
sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941 10
sol2 55.4163 56.40759 58.53444 56.86043 57.03707 71.02974 10