当文档由两列定义时获取 tf idf
Getting tf idf when documents are defined by two columns
我正在使用 tidytext
进行文本分析。我正在尝试计算语料库的 tf-idf。执行此操作的标准方法是:
book_words <- book_words %>%
bind_tf_idf(word, book, n)
但是,在我的例子中,'document' 不是由单个列定义的(如 book
)。是否可以在文档由两列(例如,book
和 chapter
)定义的地方调用 bind_tf_idf?
为什么不连接两列?例如
library(tidyverse)
library(tidytext)
library(janeaustenr)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE) %>%
ungroup()
book_words$chapter <- sample(1:10, nrow(book_words), T)
book_words %>%
unite("book_chapter", book, chapter) %>%
bind_tf_idf(word, book_chapter, n) %>% print %>%
separate(book_chapter, c("book", "chapter"), sep="_") %>%
arrange(desc(tf_idf))
我正在使用 tidytext
进行文本分析。我正在尝试计算语料库的 tf-idf。执行此操作的标准方法是:
book_words <- book_words %>%
bind_tf_idf(word, book, n)
但是,在我的例子中,'document' 不是由单个列定义的(如 book
)。是否可以在文档由两列(例如,book
和 chapter
)定义的地方调用 bind_tf_idf?
为什么不连接两列?例如
library(tidyverse)
library(tidytext)
library(janeaustenr)
book_words <- austen_books() %>%
unnest_tokens(word, text) %>%
count(book, word, sort = TRUE) %>%
ungroup()
book_words$chapter <- sample(1:10, nrow(book_words), T)
book_words %>%
unite("book_chapter", book, chapter) %>%
bind_tf_idf(word, book_chapter, n) %>% print %>%
separate(book_chapter, c("book", "chapter"), sep="_") %>%
arrange(desc(tf_idf))