如何从 R 中的给定文本中挖掘多词?

How to mine multiwords from a given text in R?

library(tm)
library(stringr)
txt <- "Netherland Belgium UK Sweden France Russia Government and People"
words <- c("land", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse ="|")
cntry <- str_extract_all(txt, pattern)

虽然在我的文本中找不到 land 作为单独的词,但代码取自 Netherland 的最后一部分。我如何才能强制代码严格查找 words 中包含的单词? 变量 cntry 的输出:

 "land"  "Sweden"  "Government and People"

我需要 cntry 的输出:

 "Sweden"  "Government and People"

如果任务真的只是为了防止从Netherlands中提取land,可以通过将锚点\b(用于单词边界)添加到land来实现] 在向量中 words:

words <- c("\bland", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse = "|")
str_extract_all(txt, pattern)
[[1]]
[1] "Sweden"                "Government and People"

这是一个不太优雅的解决方法,对于评论来说太长了,所以张贴作为答案。 land 似乎是问题字符串,而其他字符串可以使用评论中发布的 str_extract_all 提取。

在这个答案中,我专注于在给定模式 land 的情况下提取 Netherland。另一个类似的例子是根据模式 den 提取 Sweden

这是使用 regmatchesregexec 实现此目的的函数:

函数

return_partials <- function(txt, problem_patterns){
  ret_vec <- sapply(problem_patterns, function(z){
    list_output <- regmatches(x = txt, 
                    m = regexec(pattern = paste('[[:space:]]{0,1}?(.*', z, ')', sep = ''), 
                                text = txt))
    return(list_output[[1]][2])
  })
  return(unname(ret_vec))
}

输出

> return_partials(txt = txt, problem_patterns = c('land', 'den'))
[1] "Netherland" "Sweden"

您可以将其与 Chris Ruehlemann 的回答结合起来:

words <- c("\bland", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse = "|")
sol1 <- unlist(str_extract_all(txt, pattern))

sol2 <- return_partials(txt = txt, problem_patterns = c('land', 'den'))

> unique(c(sol1, sol2))
[1] "Sweden"                "Government and People" "Netherland"