计算文档数据框的“tf-idf”
Calculate `tf-idf` for a data frame of documents
以下代码
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
取自Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles,估计是简奥斯汀作品中的tf-idf
。不管怎样,这段代码似乎是简·奥斯汀的书所特有的。我想推导出以下数据框的 tf-idf
:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
df=data.frame(text = sentences,
class = c("A","B","A","C","A","B","A","C","D"),
weight = c(1,1,3,4,1,2,3,4,5))
您需要更改两件事:
由于在构造data.frame
时没有设置stringsAsFactors = FALSE
,需要先将text
转换为字符。
您没有名为 book
的列,这意味着您必须 select 一些其他列作为 document
。由于您在示例中放入了一个名为 class
的列,因此我假设您要计算此列的 tf-idf。
代码如下:
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
mutate(text = as.character(text)) %>%
unnest_tokens(output = word, input = text) %>%
count(class, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#> class word n tf idf tf_idf
#> <fct> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A blue 2 0.0769 0.288 0.0221
#> 2 A you 2 0.0769 0.693 0.0533
#> 3 C is 2 0.2 0.693 0.139
#> 4 A and 1 0.0385 1.39 0.0533
#> 5 A are 1 0.0385 1.39 0.0533
#> 6 A beautiful 1 0.0385 1.39 0.0533
#> 7 A because 1 0.0385 1.39 0.0533
#> 8 A color 1 0.0385 1.39 0.0533
#> 9 A colour 1 0.0385 1.39 0.0533
#> 10 A dress 1 0.0385 1.39 0.0533
#> # ... with 42 more rows
该文档对此签出有帮助的注释 ?count
和 ?bind_tf_idf
。
以下代码
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(word, book, n)
book_words
取自Term Frequency and Inverse Document Frequency (tf-idf) Using Tidy Data Principles,估计是简奥斯汀作品中的tf-idf
。不管怎样,这段代码似乎是简·奥斯汀的书所特有的。我想推导出以下数据框的 tf-idf
:
sentences<-c("The color blue neutralizes orange yellow reflections.",
"Zod stabbed me with blue Kryptonite.",
"Because blue is your favourite colour.",
"Red is wrong, blue is right.",
"You and I are going to yellowstone.",
"Van Gogh looked for some yellow at sunset.",
"You ruined my beautiful green dress.",
"You do not agree.",
"There's nothing wrong with green.")
df=data.frame(text = sentences,
class = c("A","B","A","C","A","B","A","C","D"),
weight = c(1,1,3,4,1,2,3,4,5))
您需要更改两件事:
由于在构造
data.frame
时没有设置stringsAsFactors = FALSE
,需要先将text
转换为字符。您没有名为
book
的列,这意味着您必须 select 一些其他列作为document
。由于您在示例中放入了一个名为class
的列,因此我假设您要计算此列的 tf-idf。
代码如下:
library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- df %>%
mutate(text = as.character(text)) %>%
unnest_tokens(output = word, input = text) %>%
count(class, word, sort = TRUE)
book_words <- book_words %>%
bind_tf_idf(term = word, document = class, n)
book_words
#> # A tibble: 52 x 6
#> class word n tf idf tf_idf
#> <fct> <chr> <int> <dbl> <dbl> <dbl>
#> 1 A blue 2 0.0769 0.288 0.0221
#> 2 A you 2 0.0769 0.693 0.0533
#> 3 C is 2 0.2 0.693 0.139
#> 4 A and 1 0.0385 1.39 0.0533
#> 5 A are 1 0.0385 1.39 0.0533
#> 6 A beautiful 1 0.0385 1.39 0.0533
#> 7 A because 1 0.0385 1.39 0.0533
#> 8 A color 1 0.0385 1.39 0.0533
#> 9 A colour 1 0.0385 1.39 0.0533
#> 10 A dress 1 0.0385 1.39 0.0533
#> # ... with 42 more rows
该文档对此签出有帮助的注释 ?count
和 ?bind_tf_idf
。