R str_extract 省略号前后的所有内容

Question

我正在尝试找到一种方法将中间带有省略号的字符列拆分为两列，省略号之前的所有内容和之后的所有内容。

例如，如果我有：

a <- "60.4 (b)(33) and (e)(1) revised....................................46111"

如何将其拆分为“60.4 (b)(33) 和 (e)(1) 修订版”和“46111”？

我试过：

str_extract(a, ".*\.{2,}")

第一部分，第二部分：

str_extract(a, "\.{2,}.*")

但这两者都保留了省略号，我想删除它。

Answer 1

您似乎想要拆分，而不是提取，具有匹配两个或多个连续点的模式：

a <- "60.4 (b)(33) and (e)(1) revised....................................46111"
unlist(stringr::str_split(a, "\.{2,}"))
## => [1] "60.4 (b)(33) and (e)(1) revised" "46111"                          

## Base R strsplit:
unlist(strsplit(a, "\.{2,}"))
## => [1] "60.4 (b)(33) and (e)(1) revised" "46111"

这里还有另一种可能的拆分正则表达式：您可以匹配字符串末尾跟有一个或多个数字的任何一个或多个点：

unlist(stringr::str_split(a, "\.+(?=\d+$)"))
unlist(strsplit(a, "\.+(?=\d+$)", perl=TRUE))

两者都产生相同的 [1] "60.4 (b)(33) and (e)(1) revised" "46111" 输出。在这里，\.+ 匹配一个或多个点，(?=\d+$) 是一个正向先行，它匹配一个位置，紧跟一个或多个数字 (\d+)，然后是字符串结尾 ($).

另一种方法是匹配与 str_match 的方法（以捕获您需要的位）：

res <- stringr::str_match(a, "^(.*?)\.+(\d+)$")
res[,-1]
# => [1] "60.4 (b)(33) and (e)(1) revised" "46111"

这里，

^ - 匹配字符串的开头
(.*?) - 第 1 组：除换行字符外的任何零个或多个字符，尽可能少
\.+ - 一个或多个点
(\d+) - 第 2 组：一个或多个数字
$ - 字符串结尾。

res[,-1] 是删除具有完整匹配项的第一列所必需的。

R str_extract 省略号前后的所有内容

R str_extract everything before and after ellipsis

regex

r

stringr