如何计算 word/token 在 one-token-per-document-per-row tibble 中的出现次数

Question

你好，我在 tidytext::unnest_tokens() 和 count(category, word, name = "count") 的管道中有一个小问题。看起来像这个例子。

owl <- tibble(category = c(0, 1, 2, -1, 0, 1, 2),
              word = c(rep("hello", 3), rep("world", 4)),
              count = sample(1:100, 7))

并且我想通过一个额外的列来获取此小标题，该列给出了该词出现的类别数，即该词每次出现的类别数相同。

我尝试了以下原则上有效的代码。结果就是我想要的。

owl %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()}))

但是，看到我的数据有几千行，这需要相当长的时间。有没有更有效的方法来实现这个？

Answer 1

我们可以使用 add_count:

library(dplyr)

 owl %>% 
   add_count(word)

输出：

  category word  count     n
     <dbl> <chr> <int> <int>
1        0 hello    98     3
2        1 hello    30     3
3        2 hello    37     3
4       -1 world    22     4
5        0 world    80     4
6        1 world    18     4
7        2 world    19     4

Answer 2

我尝试了一些解决方案和微基准测试。我将 TarJae 的提议添加到基准测试中。我还想使用神奇的 ave 函数，看看它与 dplyr 解决方案相比如何。

library(microbenchmark)

n <- 500

owl2 <- tibble(
  category = sample(-10:10, n , replace = TRUE),
  word = sample(stringi::stri_rand_strings(5, 10), n, replace = TRUE),
  count = sample(1:100, n, replace = TRUE))

mb <- microbenchmark(
  op = owl2 %>% mutate(sum_t = sapply(1:nrow(.), function(x) {filter(., word == .$word[[x]]) %>% nrow()})),
  group_by = owl2 %>% group_by(word) %>% mutate(n = n()), 
  add_count = owl2 %>% add_count(word), 
  ave = cbind(owl2, n = ave(owl2$word, owl2$word, FUN = length)), 
  times = 50L)

autoplot(mb) + theme_bw()

结论是使用 add_count 的优雅解决方案将为您节省大量时间，并且大大加快了过程。

如何计算 word/token 在 one-token-per-document-per-row tibble 中的出现次数

How to count occurrences of a word/token in a one-token-per-document-per-row tibble

r

text-mining