从 R 列中提取字符串的变体

Question

我有关键字列表

keywords=c("Minister", "President","Secretary")

我有一列在不同的行中有不同的文本

column=c("he is general Secretary of Ozon group", "He is vice president of 
our college", "He is health minister", "He is education minister")

是否有任何方法可以根据关键字提取列中存在的变体？

我正在寻找的输出是

output=c("general Secretary","vice president", "education minister", "health minister")

Answer 1

如果你想提取关键字+任何前面的词，你可以这样做：

pat <- paste0("\w+\s(", paste(keywords, collapse = "|"), ")")
regmatches(column, gregexpr(pat, column, ignore.case = TRUE))
#[[1]]
#[1] "general Secretary"
#
#[[2]]
#[1] "vice president"
#
#[[3]]
#[1] "health minister"
#
#[[4]]
#[1] "education minister"

或使用 stringr

library(stringr)
pat <- paste0("\w+\s(", paste(tolower(keywords), collapse = "|"), ")")
str_extract_all(tolower(column), pat)

从 R 列中提取字符串的变体

Extract variations of a string from R column

nlp

r

text-mining