为什么 tf-idf 会截断单词？

Question

我有一个数据框 x 是：

> str(x)
'data.frame':   117654 obs. of  2 variables:
$ text  : chr  "more about " ...
$ doc_id: chr  "Text 1" "Text 2" "Text 3" "Text 4" ...

我不能在这里报告，dput，因为它太大了。我正在尝试估计 TF-IDF 并编写了代码：

library(dplyr)
library(janeaustenr)
library(tidytext)
book_words <- x %>%
  mutate(text = as.character(text)) %>% 
  unnest_tokens(output = word, input = text) %>%
  count(doc_id, word, sort = TRUE)

book_words <- book_words %>%
  bind_tf_idf(term = word, document = doc_id, n)

book_words<-book_words[order(book_words$tf_idf,decreasing=FALSE),]
book_words = book_words[!duplicated(book_words$word),]

无论如何，我注意到有些词似乎被运行归类在 book_words 中。例如：

             doc_id          word n  tf      idf    tf_idf
 792727  Text 33268     disposabl 1 1.0 11.67321 11.673214

我确定这是一个 t运行的术语，因为如果我运行:

x[grepl("^disposabl$",x$text),]

我没有获得任何行。

你遇到过这种情况吗？

Answer 1

从您的输出来看，名称中似乎有前导空白-space。如果只是 "dispoabl" 没有 leading/trailing 空格，我希望

            doc_id      word n tf      idf   tf_idf
 792727 Text 33268 disposabl 1  1 11.67321 11.67321
 ###              ^         ^   one space each

但是你的输出显示

             doc_id          word n  tf      idf    tf_idf
 792727  Text 33268     disposabl 1 1.0 11.67321 11.673214
                    ^^^^  four extra blanks

这意味着您的 "^dispoabl$" 过于严格。尝试过滤（此处）：

x[grepl("disposabl$",x$text),]

删除前导 ^，因此允许 d 之前的内容。备选方案：

"\bdisposabl$" 添加了一个单词边界，因此 "adisposabl" 不会匹配，但 "a disposabl" 仍会匹配；
"^\s*disposabl$" 要求前导部分为空白-space;
trim 空白 space 和 x[grepl("^disposabl$",trimws(x$text))]，您的原始模式将在此处起作用。

为什么 tf-idf 会截断单词？

Why tf-idf truncates words?

r

tf-idf