指定类别中关键字的匹配数

number of matches for keywords in specified categories

对于大规模文本分析问题,我有一个包含属于不同类别的单词的数据框,以及一个包含字符串列和每个类别的(空)计数列的数据框。我现在想获取每个单独的字符串,检查出现了哪些已定义的词,并将它们计入适当的类别。

作为一个简化的例子,给定下面的两个数据框,我想计算文本单元格中出现的每种动物类型的数量。

df_texts <- tibble(
  text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
  grasshopper"),
  mammals=NA,
  reptiles=NA,
  birds=NA,
  insects=NA
)

df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
           type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))

所以我想要的结果是:

df_result <- tibble(
  text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
  grasshopper"),
  mammals=c(2,1,0),
  reptiles=c(0,1,0),
  birds=c(0,0,1),
  insects=c(0,0,1)
)

是否有一种直接的方法来实现适用于更大数据集的关键字匹配和计数?

提前致谢!

这是 tidyverse 中的一种方法。先看df_texts$text中的字符串是否包含动物,然后统计它们,按文本和类型求和。

library(tidyverse)

cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>% 
  pivot_longer(-text, names_to = "animals") %>% 
  left_join(df_animals) %>% 
  group_by(text, type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(id_cols = text, names_from = type, values_from = sum)

  text                                   bird insect mammal reptile
  <chr>                                 <int>  <int>  <int>   <int>
1 "the ape and the fox"                     0      0      2       0
2 "the owl and the the \n  grasshopper"     1      0      0       0
3 "the tortoise and the hare"               0      0      1       1

考虑到每个文本的多次出现:

cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% 
  setNames(c("text", df_animals$animals)) %>% 
  pivot_longer(-text, names_to = "animals") %>% 
  left_join(df_animals) %>% 
  group_by(text, type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(id_cols = text, names_from = type, values_from = sum)