计算文档数据框的“tf-idf”

Calculate `tf-idf` for a data frame of documents

以下代码

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(word, book, n)
book_words

取自Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles,估计是简奥斯汀作品中的tf-idf。不管怎样,这段代码似乎是简·奥斯汀的书所特有的。我想推导出以下数据框的 tf-idf

sentences<-c("The color blue neutralizes orange yellow reflections.", 
             "Zod stabbed me with blue Kryptonite.", 
             "Because blue is your favourite colour.",
             "Red is wrong, blue is right.",
             "You and I are going to yellowstone.",
             "Van Gogh looked for some yellow at sunset.",
             "You ruined my beautiful green dress.",
             "You do not agree.",
             "There's nothing wrong with green.")

 df=data.frame(text = sentences, 
               class = c("A","B","A","C","A","B","A","C","D"),
               weight = c(1,1,3,4,1,2,3,4,5))

您需要更改两件事:

  1. 由于在构造data.frame时没有设置stringsAsFactors = FALSE,需要先将text转换为字符。

  2. 您没有名为 book 的列,这意味着您必须 select 一些其他列作为 document。由于您在示例中放入了一个名为 class 的列,因此我假设您要计算此列的 tf-idf。

代码如下:

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
  mutate(text = as.character(text)) %>% 
  unnest_tokens(output = word, input = text) %>%
  count(class, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#>    class word          n     tf   idf tf_idf
#>    <fct> <chr>     <int>  <dbl> <dbl>  <dbl>
#>  1 A     blue          2 0.0769 0.288 0.0221
#>  2 A     you           2 0.0769 0.693 0.0533
#>  3 C     is            2 0.2    0.693 0.139 
#>  4 A     and           1 0.0385 1.39  0.0533
#>  5 A     are           1 0.0385 1.39  0.0533
#>  6 A     beautiful     1 0.0385 1.39  0.0533
#>  7 A     because       1 0.0385 1.39  0.0533
#>  8 A     color         1 0.0385 1.39  0.0533
#>  9 A     colour        1 0.0385 1.39  0.0533
#> 10 A     dress         1 0.0385 1.39  0.0533
#> # ... with 42 more rows

该文档对此签出有帮助的注释 ?count?bind_tf_idf