从R中的向量中提取子串

Question

我正在尝试从非结构化文本中提取子字符串。例如，假设一个国家名称向量：

countries <- c("United States", "Israel", "Canada")

我该如何传递这个字符值向量以从非结构化文本中提取精确匹配项。

text.df <- data.frame(ID = c(1:5), 
text = c("United States is a match", "Not a match", "Not a match",
         "Israel is a match", "Canada is a match"))

在此示例中，所需的输出为：

ID     text
1      United States
4      Israel
5      Canada

到目前为止，我一直在使用 gsub 删除所有不匹配项，然后删除然后删除具有空值的行。我也一直在使用 stringr 包中的 str_extract，但没有成功地获得正确的正则表达式参数。如有任何帮助，我们将不胜感激！

Answer 1

1. stringr

我们可以首先使用 'indx'（通过折叠 'countries' 向量形成）作为 'grep' 中的模式对 'text.df' 进行子集化，然后使用 'str_extract'从 'text' 列获取模式元素，将其分配给子集数据集 ('text.df1')

的 'text' 列

library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
#  ID          text
#1  1 United States
#4  4        Israel
#5  5        Canada

2。基础 R

在不使用任何外部包的情况下，我们可以删除'ind'

以外的字符

text.df1$text <- unlist(regmatches(text.df1$text, 
                           gregexpr(indx, text.df1$text)))

3。 stringi

我们还可以使用 stringi

中更快的 stri_extract

library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
#  ID         text1
#1  1 United States
#4  4        Israel
#5  5        Canada

Answer 2

这是 data.table 的方法：

library(data.table)
##
R>  data.table(text.df)[
    sapply(countries, function(x) grep(x,text),USE.NAMES=F),
    list(ID, text = countries)]
   ID          text
1:  1 United States
2:  4        Israel
3:  5        Canada

Answer 3

创建模式 p，并使用 strapply 提取与 text 的每个组件的匹配，为每个不匹配的组件返回 NA。最后使用 na.omit 删除 NA 值。这是非破坏性的（即 text.df 未被修改）：

library(gsubfn)

p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))

给予：

  ID          text
1  1 United States
4  4        Israel
5  5        Canada

使用dplyr也可以写成如下（使用上面的p）：

library(dplyr)
library(gsubfn)

text.df %>% 
  mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
  na.omit

从R中的向量中提取子串

Substring extraction from vector in R

regex

r

stringr