连续重复二元组的正则表达式

Question

我的问题是关于检测字符串中连续单词（unigrams）的早期 question 的直接扩展。

在上一个问题中，

Not that that is related

可以通过这个正则表达式检测到：\b(\w+)\s+\b

在这里，我想检测连续的二元组（词对）：

are blue and then and then very bright

理想情况下，我也想知道如何将检测到的模式（重复）替换为单个元素，从而最终得到：

are blue and then very bright

（对于这个应用程序，如果重要的话，我在 R 中使用 gsub）

Answer 1

试试下面的正则表达式：

(\b.+?\b)\b

RegEx 将捕获一个词边界，然后是数据，然后是另一个词边界。 </code> 将引用捕获的内容，select 再次引用。然后它将检查末尾的单词边界以防止 <code>a and 和 z zoo 被 selected

至于替换，使用</code>。这将包含来自 <strong><code>1st Capture Group 的数据（二元组的第一部分），第一部分将用于替换整个内容。

Live Demo on Regex101

Answer 2

这里的重点是，在某些情况下，会出现包含较短重复子串的重复子串。因此，要匹配较长的，您可以使用

(\b.+\b)\b

（参见 regex demo）对于那些寻找较短子字符串的人，我会依赖 惰性点匹配:

(\b.+?\b)\b

参见 this regex demo。替换字符串将是 </code> - 对首先与分组结构 <code>(...).

匹配的捕获部分的反向引用

你需要一个 PCRE 正则表达式来让它工作，因为有记录的问题与 gsub 匹配多个单词边界（所以，添加 perl=T 参数）。

POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).

请注意，如果您的重复子字符串可以跨越多行，您可以在模式的开头使用带有 DOTALL 修饰符 (?s) 的 PCRE 正则表达式（这样 . 也可以匹配换行符）。

所以，R 代码看起来像

gsub("(?s)(\b.+\b)\1\b", "\1", s, perl=T)

或

gsub("(?s)(\b.+?\b)\1\b", "\1", s, perl=T)

参见 IDEONE demo:

text <- "are blue and then and then more and then and then more very bright"
gsub("(?s)(\b.+?\b)\1\b", "\1", text, perl=T) ## shorter repeated substrings
## [1] "are blue and then more and then more very bright"
gsub("(?s)(\b.+\b)\1\b", "\1", text, perl=T) ## longer repeated substrings
## [1] "are blue and then and then more very bright"

连续重复二元组的正则表达式

Regular Expression For Consecutive Duplicate Bigrams

regex

r

gsub