得到tf-idf矩阵后如何计算单个term的tf-idf?

How to calculate tf-idf for a single term after getting the tf-idf matrix?

过去,我在为我的一个文档构建 tf-idf 时得到了帮助,并得到了我想要的输出(请参见下文)。

TagSet <- data.frame(emoticon = c("","","","",""),
                     stringsAsFactors = FALSE)

TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
                                "Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
                                " #healthy #vegetarian #beatchronicillness fix infrastructure",
                                "LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
                                "I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
                                " I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
                                "#Englishman  #Londoner  @Chelseafc  ️‍♂️   ",
                                "F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
                                "❄️Do not dwell in tbaconhe past, do not dream of the future, concentrate the mind on the present moment.️❄️",
                                "Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro  | Hello intro on the Minds Link |"),
                      stringsAsFactors = FALSE)


library(dplyr)
library(quanteda)

tweets_dfm <- dfm(TextSet$tweet)  # convert to document-feature matrix

tweets_dfm %>% 
  dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
  dfm_tfidf() %>%                 # weight with tfidf
  convert("data.frame")           # turn into data.frame to display more easily

#     document                                                    
# 1     text1      1.39794            1            0            0            0
# 2     text2      0.00000            0            1            0            0
# 3     text3      0.00000            0            0            0            0
# 4     text4      0.00000            0            0            0            0
# 5     text5      0.00000            0            0            0            0
# 6     text6      0.69897            0            0            0            0
# 7     text7      0.00000            0            0            1            1
# 8     text8      0.00000            0            0            0            0
# 9     text9      0.00000            0            0            0            0
# 10   text10      0.00000            0            0            0            0

但我需要一些帮助来计算每个单项的 tf-idf。意思是,如何从矩阵中准确获取每个术语的 tf-idf 值?

# terms      tfidf
#       #its tfidf the correct way   
#       #its tfidf the correct way 
#       #its tfidf the correct way 
#       #its tfidf the correct way 
#       #its tfidf the correct way 

我敢肯定,这不像是从其矩阵列中为术语添加所有 tf​​-idf,然后除以它出现的文档。这就是该术语的价值。

我已经查看了一些来源,例如此处 https://stats.stackexchange.com/questions/422750/how-to-calculate-tf-idf-for-a-single-term,但作者提出的问题完全来自我阅读的内容。

我目前在 text-mining/analysis 术语方面很薄弱。

简而言之,您无法计算每个特征的 tf-idf 值,将其与其文档上下文隔离开来,因为每个特征的 tf-idf 值都是特定于文档的。

更具体地说:

  • (inverse) 文档频率是每个特征一个值,所以由 $j$
  • 索引
  • term frequency 是每个文档每个术语一个值,因此由 $ij$
  • 索引
  • tf-idf 因此由 $i,j$ 索引

您可以在示例中看到这一点:

> tweets_dfm %>% 
+   dfm_tfidf() %>%
+   dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
+   as.matrix()
        features
docs     \U0001f914 \U0001f4aa \U0001f603 \U0001f953 \U0001f37a
  text1     1.39794          1          0          0          0
  text2     0.00000          0          1          0          0
  text3     0.00000          0          0          0          0
  text4     0.00000          0          0          0          0
  text5     0.00000          0          0          0          0
  text6     0.69897          0          0          0          0
  text7     0.00000          0          0          1          1
  text8     0.00000          0          0          0          0
  text9     0.00000          0          0          0          0
  text10    0.00000          0          0          0          0

还有两件事:

  1. 考虑到反向文档频率的特征已经是一种平均值,或者至少是出现术语的文档的反比例,按特征进行平均并不是真正有意义的事情。此外,这通常会被记录下来,因此在计算平均值之前需要进行一些转换。

  2. 上面,我计算了 tf-idf before 删除其他特征,因为如果你使用 relative ("normalized" ) 词频。 dfm_tfidf()默认使用词条计数,所以这里的结果不受影响。