如何在 R 中的特定术语的每一侧提取 2-4 个单词？

Question

如何从 R 中的 string/corpus 中的特定术语的每一侧提取 2-4 个单词？

这是一个例子：

我想在'converse'周围提取2个词。

txt <- "Socially when people meet they should converse to present their
       views and listen to other people's opinions to enhance their perspective"

输出应该是这样的：

"they should converse to present"

Answer 1

我想这可以解决您的问题：

/((?:\S+\s){2}converse(?:\s\S+){2})/

演示：https://regex101.com/r/tS9kB0/1

如果你需要两边的其他重量，我想你可以看看要改变什么。

Answer 2

sub('.*?(\w+ \w+) (converse) (\w+ \w+).*', '\1 \2 \3', s)
[1] "they should converse to present"

Answer 3

这可能是使用 strsplit

的另一种方式

sapply(strsplit(txt, ' '), function(x) 
paste(x[(which(x %in% 'converse')-2):(which(x %in% 'converse')+2)], collapse= ' '))

#[1] "they should converse to present"

Answer 4

qdapRegex 包（我维护的）有一个固定的正则表达式，用于抓取单词 before/after 一个单词，可以通过以下方式使用：

library(qdapRegex)

grab2 <- rm_(pattern=S("@around_", 2, "converse", 2), extract=TRUE)
grab2(txt)

## [[1]]
## [1] "they should converse to present"

查看使用的正则表达式：

S("@around_", 2, "converse", 2)
[1] "(?:[^[:punct:]|\s]+\s+){0,2}(converse)(?:\s+[^[:punct:]|\s]+){0,2}"

如何在 R 中的特定术语的每一侧提取 2-4 个单词？

How can I extract 2-4 words on each side of a specific term in R?

regex

r

text-mining

sentiment-analysis