从 R 列中提取字符串的变体
Extract variations of a string from R column
我有关键字列表
keywords=c("Minister", "President","Secretary")
我有一列在不同的行中有不同的文本
column=c("he is general Secretary of Ozon group", "He is vice president of
our college", "He is health minister", "He is education minister")
是否有任何方法可以根据关键字提取列中存在的变体?
我正在寻找的输出是
output=c("general Secretary","vice president", "education minister", "health minister")
如果你想提取关键字+任何前面的词,你可以这样做:
pat <- paste0("\w+\s(", paste(keywords, collapse = "|"), ")")
regmatches(column, gregexpr(pat, column, ignore.case = TRUE))
#[[1]]
#[1] "general Secretary"
#
#[[2]]
#[1] "vice president"
#
#[[3]]
#[1] "health minister"
#
#[[4]]
#[1] "education minister"
或使用 stringr
library(stringr)
pat <- paste0("\w+\s(", paste(tolower(keywords), collapse = "|"), ")")
str_extract_all(tolower(column), pat)
我有关键字列表
keywords=c("Minister", "President","Secretary")
我有一列在不同的行中有不同的文本
column=c("he is general Secretary of Ozon group", "He is vice president of
our college", "He is health minister", "He is education minister")
是否有任何方法可以根据关键字提取列中存在的变体?
我正在寻找的输出是
output=c("general Secretary","vice president", "education minister", "health minister")
如果你想提取关键字+任何前面的词,你可以这样做:
pat <- paste0("\w+\s(", paste(keywords, collapse = "|"), ")")
regmatches(column, gregexpr(pat, column, ignore.case = TRUE))
#[[1]]
#[1] "general Secretary"
#
#[[2]]
#[1] "vice president"
#
#[[3]]
#[1] "health minister"
#
#[[4]]
#[1] "education minister"
或使用 stringr
library(stringr)
pat <- paste0("\w+\s(", paste(tolower(keywords), collapse = "|"), ")")
str_extract_all(tolower(column), pat)