逻辑字符串匹配

Logical string matching

将单词与我的句子匹配的最佳方式是什么?这是一个小例子:

words <- c("apple", "pear", "grape")
sentences <- c("I have an apple and a pear", "Grape is my favorite", "I don't like pear")

最好是输出如下所示:

count  sentence 
2      "I have an apple and a pear"
1      "Grape is my favorite"
1      "I don't like pear

我试过使用 str_count 但无济于事。感谢您的帮助!

library(stringr)
str_count(sentences, paste0("(?i)\b(", paste0(words, collapse = "|"), ")\b"))
[1] 2 1 1

这是如何工作的:

  • (?i):这确保模式匹配不区分大小写
  • \b\b 确保单词匹配为带有单词边界的单词(如果未使用 \b 你可能最终会匹配到 包含你的话但自己形成一个不同的词,例如grapple,它包含apple)
  • () 形成一个非捕获组,其内容是 words 分隔的,或者如果您愿意,可以合并,用竖线 | , 表示 'OR'.
  • 的交替元字符

如果你想把它放在数据框中:

df <- data.frame(
  sentences = sentences,
  count = str_count(sentences, paste0("(?i)\b(", paste0(words, collapse = "|"), ")\b")))

结果:

  df
                     sentences count
  1 I have an apple and a pear     2
  2       Grape is my favorite     1
  3          I don't like pear     1