使用正则表达式提取 r ngram

Question

Karl Broman 的 post：https://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/ 让我玩正则表达式和 ngrams 只是为了好玩。我试图使用正则表达式来提取 2 克。我知道有解析器可以做到这一点，但我对正则表达式逻辑感兴趣（即，这是我未能满足的自我挑战）。

下面我给出了一个最小的例子和所需的输出。我尝试的问题是 2 倍：

克（字）被吃掉，无法用于下一次传递。 我怎样才能使它们可用于第二遍？（例如，我希望 like 在先前已在 [=14] 中使用后可用于 like toast =])
我无法使单词之间的 space 不被捕获（请注意输出中尾随的白色 space，即使我使用了 (?:\s*)）。 我怎么能不捕获第 n 个（在本例中是第二个）单词的尾随 spaces？我知道这可以简单地完成："(\b[A-Za-z']+\s)(\b[A-Za-z']+)" for a 2 -gram 但我想将解决方案扩展到 n-gram。 PS 我知道 \w 但我不将下划线和数字视为单词部分，但确实将 ' 视为单词部分。

MWE:

library(stringi)

x <- "I like toast and jam."

stringi::stri_extract_all_regex(
    x,
    pattern = "((\b[A-Za-z']+\b)(?:\s*)){2}"
)

## [[1]]
## [1] "I like "    "toast and "

期望输出：

## [[1]]
## [1] "I like"  "like toast"    "toast and"  "and jam"

Answer 1

这是使用基本 R 正则表达式的一种方法。这可以轻松扩展以处理任意 n-gram。诀窍是将捕获组放在积极的先行断言中，例如 (?=(my_overlapping_pattern))

x <- "I like toast and jam."
pattern <- "(?=(\b[A-Za-z']+\b \b[A-Za-z']+\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)

# [[1]]
# [1] "I like"     "like toast" "toast and"  "and jam"

Answer 2

实际上，有一个应用程序：quanteda 包（用于文本数据的定量分析）。我和我的合著者 Paul Nulty 正在努力改进它，但它可以轻松处理您描述的用例。

install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like"     "like_toast" "toast_and"  "and_jam"   
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like"     "like toast" "toast and"  "and jam"

不需要痛苦的正则表达式！

使用正则表达式提取 r ngram

r ngram extraction with regex

regex

r

stringi