如何在 R 中使用 str_split 和正则表达式?
How to use str_split with regex in R?
我有这个字符串:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
我想用 6 位数字拆分字符串。 IE。 - 我想要这个:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
如何使用正则表达式执行此操作?以下不起作用(使用 stringr
包):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
我错过了什么??
这是一种使用正向前瞻和后视的基础 R 方法,感谢@thelatemail 的更正:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
str_extract_all
的另一种方法。注意我已经使用 .*?
进行 'non-greedy' 匹配,否则 .*
会展开以获取所有内容:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
easy-to-understand 方法是添加一个标记,然后在这些标记的位置进行分割。这样做的好处是能够只查找 6 位数字序列而不需要周围文本中的任何其他特征,这些特征可能会随着您添加新的和未经审查的数据而改变。
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\d{6})", "#SPLIT_HERE#\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
如果您的匹配出现在字符串的开头或结尾,str_split()
将在结果向量中插入空白字符条目以表明这一点(如上所示)。如果您不需要该信息,可以使用 out[nchar(out) != 0]
.
轻松删除它
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"
使用不太复杂的正则表达式,您可以执行以下操作:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
我有这个字符串:
235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things
我想用 6 位数字拆分字符串。 IE。 - 我想要这个:
235072,testing,some2wg2f4,wf484-things
224072,and,other25wg4,14-thingies
223552,testing,some/2wr24,14084-things
如何使用正则表达式执行此操作?以下不起作用(使用 stringr
包):
> blahblah <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
> test <- str_split(blahblah, "([0-9]{6}.*)")
> test
[[1]]
[1] "" ""
我错过了什么??
这是一种使用正向前瞻和后视的基础 R 方法,感谢@thelatemail 的更正:
strsplit(x, "(?<=.)(?=[0-9]{6})", perl = TRUE)[[1]]
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"
str_extract_all
的另一种方法。注意我已经使用 .*?
进行 'non-greedy' 匹配,否则 .*
会展开以获取所有内容:
> str_extract_all(blahblah, "[0-9]{6}.*?(?=[0-9]{6}|$)")[[1]]
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
easy-to-understand 方法是添加一个标记,然后在这些标记的位置进行分割。这样做的好处是能够只查找 6 位数字序列而不需要周围文本中的任何其他特征,这些特征可能会随着您添加新的和未经审查的数据而改变。
library(stringr)
library(magrittr)
str <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
out <-
str_replace_all(str, "(\d{6})", "#SPLIT_HERE#\1") %>%
str_split("#SPLIT_HERE#") %>%
unlist
[1] "" "235072,testing,some252f4,14084-things"
[3] "224072,and,other2524,14084-thingies" "223552,testing,some/2wr24,14084-things"
如果您的匹配出现在字符串的开头或结尾,str_split()
将在结果向量中插入空白字符条目以表明这一点(如上所示)。如果您不需要该信息,可以使用 out[nchar(out) != 0]
.
[1] "235072,testing,some252f4,14084-things" "224072,and,other2524,14084-thingies"
[3] "223552,testing,some/2wr24,14084-things"
使用不太复杂的正则表达式,您可以执行以下操作:
s <- "235072,testing,some252f4,14084-things224072,and,other2524,14084-thingies223552,testing,some/2wr24,14084-things"
l <- str_locate_all(string = s, "[0-9]{6}")
str_sub(string = s, start = as.data.frame(l)$start,
end = c(tail(as.data.frame(l)$start, -1) - 1, nchar(s)) )
# [1] "235072,testing,some252f4,14084-things"
# [2] "224072,and,other2524,14084-thingies"
# [3] "223552,testing,some/2wr24,14084-things"