如何从 R 中的给定文本中挖掘多词?
How to mine multiwords from a given text in R?
library(tm)
library(stringr)
txt <- "Netherland Belgium UK Sweden France Russia Government and People"
words <- c("land", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse ="|")
cntry <- str_extract_all(txt, pattern)
虽然在我的文本中找不到 land
作为单独的词,但代码取自 Netherland
的最后一部分。我如何才能强制代码严格查找 words
中包含的单词?
变量 cntry
的输出:
"land" "Sweden" "Government and People"
我需要 cntry
的输出:
"Sweden" "Government and People"
如果任务真的只是为了防止从Netherlands
中提取land
,可以通过将锚点\b
(用于单词边界)添加到land
来实现] 在向量中 words
:
words <- c("\bland", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse = "|")
str_extract_all(txt, pattern)
[[1]]
[1] "Sweden" "Government and People"
这是一个不太优雅的解决方法,对于评论来说太长了,所以张贴作为答案。 land
似乎是问题字符串,而其他字符串可以使用评论中发布的 str_extract_all
提取。
在这个答案中,我专注于在给定模式 land
的情况下提取 Netherland
。另一个类似的例子是根据模式 den
提取 Sweden
。
这是使用 regmatches
和 regexec
实现此目的的函数:
函数
return_partials <- function(txt, problem_patterns){
ret_vec <- sapply(problem_patterns, function(z){
list_output <- regmatches(x = txt,
m = regexec(pattern = paste('[[:space:]]{0,1}?(.*', z, ')', sep = ''),
text = txt))
return(list_output[[1]][2])
})
return(unname(ret_vec))
}
输出
> return_partials(txt = txt, problem_patterns = c('land', 'den'))
[1] "Netherland" "Sweden"
您可以将其与 Chris Ruehlemann 的回答结合起来:
words <- c("\bland", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse = "|")
sol1 <- unlist(str_extract_all(txt, pattern))
sol2 <- return_partials(txt = txt, problem_patterns = c('land', 'den'))
> unique(c(sol1, sol2))
[1] "Sweden" "Government and People" "Netherland"
library(tm)
library(stringr)
txt <- "Netherland Belgium UK Sweden France Russia Government and People"
words <- c("land", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse ="|")
cntry <- str_extract_all(txt, pattern)
虽然在我的文本中找不到 land
作为单独的词,但代码取自 Netherland
的最后一部分。我如何才能强制代码严格查找 words
中包含的单词?
变量 cntry
的输出:
"land" "Sweden" "Government and People"
我需要 cntry
的输出:
"Sweden" "Government and People"
如果任务真的只是为了防止从Netherlands
中提取land
,可以通过将锚点\b
(用于单词边界)添加到land
来实现] 在向量中 words
:
words <- c("\bland", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse = "|")
str_extract_all(txt, pattern)
[[1]]
[1] "Sweden" "Government and People"
这是一个不太优雅的解决方法,对于评论来说太长了,所以张贴作为答案。 land
似乎是问题字符串,而其他字符串可以使用评论中发布的 str_extract_all
提取。
在这个答案中,我专注于在给定模式 land
的情况下提取 Netherland
。另一个类似的例子是根据模式 den
提取 Sweden
。
这是使用 regmatches
和 regexec
实现此目的的函数:
函数
return_partials <- function(txt, problem_patterns){
ret_vec <- sapply(problem_patterns, function(z){
list_output <- regmatches(x = txt,
m = regexec(pattern = paste('[[:space:]]{0,1}?(.*', z, ')', sep = ''),
text = txt))
return(list_output[[1]][2])
})
return(unname(ret_vec))
}
输出
> return_partials(txt = txt, problem_patterns = c('land', 'den'))
[1] "Netherland" "Sweden"
您可以将其与 Chris Ruehlemann 的回答结合起来:
words <- c("\bland", "Sweden", "Government and People", "Government", "People")
pattern <- str_c(words,collapse = "|")
sol1 <- unlist(str_extract_all(txt, pattern))
sol2 <- return_partials(txt = txt, problem_patterns = c('land', 'den'))
> unique(c(sol1, sol2))
[1] "Sweden" "Government and People" "Netherland"