在分为句子 R 的文本中找到单词

find words in a text divided in sentences R

你好,我有一段文字,我想只检索包含特定单词的句子。这是一个例子。

my_text<- tolower(c("Pro is a molecule that can be found in the air. This molecule spreads glitter and allows bees to fly over the rainbow. For flying, bees need another molecule that is Sub. Sub is activated and so Sub is a substrate. After eating that molecule bees become very speed and they can fly highly. Pro activate Sub. This means that Sub is catalyzed by Pro."))


my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru", 
         "Sab", "Seb", "Sib", "Sob", "Sub"))

sent <- unlist(strsplit(my_text, "\."))

sent <- sent[grep(pattern = my_words, sent, ignore.case = T)] 

使用此代码我收到此警告消息

Warning message:
In grep(pattern = my_words, sent, ignore.case = T) :
  argument 'pattern' has length > 1 and only the first element will be used

如何避免这种情况?我想分析我的向量中的所有单词。我查看了 stringr 包,但找不到解决方案。

无论如何代码都可以更改,我只是展示了我所做的!

您可以从 my_words 创建正则表达式模式并在 grep 中使用它。

my_words <- tolower(c("Pro", "Pra", "Pri", "Pre", "Pru", 
                      "Sab", "Seb", "Sib", "Sob", "Sub"))
sent <- unlist(strsplit(my_text, "\."))
grep(paste0('\b', my_words, '\b', collapse = '|'), sent, ignore.case = TRUE, value = TRUE)

#[1] "pro is a molecule that can be found in the air"     
#[2] " for flying, bees need another molecule that is sub"
#[3] " sub is activated and so sub is a substrate"        
#[4] " pro activate sub"                                  
#[5] " this means that sub is catalyzed by pro"    

我已经包含单词边界 (\b),以便只有完整的单词匹配。例如,'pre' 将不匹配 'spread'

您可以将要查找的词定义为交替模式,用 \b 环绕它们以确保它们仅在作为词出现时匹配(而不是作为其他词的一部分,例如as pro --> professional) 并将该模式​​输入到您在 post 中使用的子集方法中。 我还建议您使用 trimws 来 trim 空格:

sent <- trimws(unlist(strsplit(my_text, "\.")))
pattern <- paste0("\b", my_words, "\b", collapse = "|")
sent[grepl(pattern, sent)]

您提到了 stringr 包。基于 str_detect 的解决方案是:

sent[str_detect(sent, pattern)]