如何在 R 中的特定术语的每一侧提取 2-4 个单词?
How can I extract 2-4 words on each side of a specific term in R?
如何从 R 中的 string/corpus 中的特定术语的每一侧提取 2-4 个单词?
这是一个例子:
我想在'converse'周围提取2个词。
txt <- "Socially when people meet they should converse to present their
views and listen to other people's opinions to enhance their perspective"
输出应该是这样的:
"they should converse to present"
我想这可以解决您的问题:
/((?:\S+\s){2}converse(?:\s\S+){2})/
演示:https://regex101.com/r/tS9kB0/1
如果你需要两边的其他重量,我想你可以看看要改变什么。
sub('.*?(\w+ \w+) (converse) (\w+ \w+).*', '\1 \2 \3', s)
[1] "they should converse to present"
这可能是使用 strsplit
的另一种方式
sapply(strsplit(txt, ' '), function(x)
paste(x[(which(x %in% 'converse')-2):(which(x %in% 'converse')+2)], collapse= ' '))
#[1] "they should converse to present"
qdapRegex 包(我维护的)有一个固定的正则表达式,用于抓取单词 before/after 一个单词,可以通过以下方式使用:
library(qdapRegex)
grab2 <- rm_(pattern=S("@around_", 2, "converse", 2), extract=TRUE)
grab2(txt)
## [[1]]
## [1] "they should converse to present"
查看使用的正则表达式:
S("@around_", 2, "converse", 2)
[1] "(?:[^[:punct:]|\s]+\s+){0,2}(converse)(?:\s+[^[:punct:]|\s]+){0,2}"
如何从 R 中的 string/corpus 中的特定术语的每一侧提取 2-4 个单词?
这是一个例子:
我想在'converse'周围提取2个词。
txt <- "Socially when people meet they should converse to present their
views and listen to other people's opinions to enhance their perspective"
输出应该是这样的:
"they should converse to present"
我想这可以解决您的问题:
/((?:\S+\s){2}converse(?:\s\S+){2})/
演示:https://regex101.com/r/tS9kB0/1
如果你需要两边的其他重量,我想你可以看看要改变什么。
sub('.*?(\w+ \w+) (converse) (\w+ \w+).*', '\1 \2 \3', s)
[1] "they should converse to present"
这可能是使用 strsplit
sapply(strsplit(txt, ' '), function(x)
paste(x[(which(x %in% 'converse')-2):(which(x %in% 'converse')+2)], collapse= ' '))
#[1] "they should converse to present"
qdapRegex 包(我维护的)有一个固定的正则表达式,用于抓取单词 before/after 一个单词,可以通过以下方式使用:
library(qdapRegex)
grab2 <- rm_(pattern=S("@around_", 2, "converse", 2), extract=TRUE)
grab2(txt)
## [[1]]
## [1] "they should converse to present"
查看使用的正则表达式:
S("@around_", 2, "converse", 2)
[1] "(?:[^[:punct:]|\s]+\s+){0,2}(converse)(?:\s+[^[:punct:]|\s]+){0,2}"