将关键字与一系列文本评论相匹配

Question

我有两套资料：

csv 文件，每一行都有注释，例如：

一个。我爱足球 b.橄榄球是一项艰难的运动 C。你好世界
另一个列出与运动相关的单词的 csv 文件，例如：

一个。网球 b.足球 C。橄榄球

我想做的是：一种。查找第二个文件中的任何单词是否在第一个文件的每一行中至少出现一次。 b.如果它至少出现一次，它应该被归类为运动反对每条评论，否则其他。

输出文件应如下所示：

Comments                          category
  a. I love football               sports
  b. Rugby is a tough game         sports
  c. Hello World                   others

我想在 R 中做这个练习。我在 R 中探索了 str_detect 和 grepl 函数，但没有实现期望的输出。

感谢您的帮助。

谢谢

Answer 1

这是一种迭代关键字并使用 grepl 匹配句子的方法。根据句子数据的干净程度，您可能需要考虑 agrepl，它允许模糊匹配（但也可能导致误报）。

df <- data.frame(sentences=c("I love football", "Rugby is a tough game", "Hello World"))
keywords <- c("tennis", "football", "rugby")

cbind(df, sapply(keywords, function(x) grepl(x, df$sentences, ignore.case = TRUE)))

               sentences tennis football rugby
 1       I love football  FALSE     TRUE FALSE
 2 Rugby is a tough game  FALSE    FALSE  TRUE
 3           Hello World  FALSE    FALSE FALSE

重新阅读您的 post，如果您只想标记任何运动而非个人运动，您可以这样做：

cbind(df, sports = rowSums(sapply(keywords, function(x) grepl(x, df$sentences, ignore.case = TRUE))) > 0)
              sentences sports
1       I love football   TRUE
2 Rugby is a tough game   TRUE
3           Hello World  FALSE

将关键字与一系列文本评论相匹配

matching keywords to a series of text comments

text

r

text-mining