得到tf-idf矩阵后如何计算单个term的tf-idf?
How to calculate tf-idf for a single term after getting the tf-idf matrix?
过去,我在为我的一个文档构建 tf-idf 时得到了帮助,并得到了我想要的输出(请参见下文)。
TagSet <- data.frame(emoticon = c("","","","",""),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in tbaconhe past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
# document
# 1 text1 1.39794 1 0 0 0
# 2 text2 0.00000 0 1 0 0
# 3 text3 0.00000 0 0 0 0
# 4 text4 0.00000 0 0 0 0
# 5 text5 0.00000 0 0 0 0
# 6 text6 0.69897 0 0 0 0
# 7 text7 0.00000 0 0 1 1
# 8 text8 0.00000 0 0 0 0
# 9 text9 0.00000 0 0 0 0
# 10 text10 0.00000 0 0 0 0
但我需要一些帮助来计算每个单项的 tf-idf。意思是,如何从矩阵中准确获取每个术语的 tf-idf 值?
# terms tfidf
# #its tfidf the correct way
# #its tfidf the correct way
# #its tfidf the correct way
# #its tfidf the correct way
# #its tfidf the correct way
我敢肯定,这不像是从其矩阵列中为术语添加所有 tf-idf,然后除以它出现的文档。这就是该术语的价值。
我已经查看了一些来源,例如此处 https://stats.stackexchange.com/questions/422750/how-to-calculate-tf-idf-for-a-single-term,但作者提出的问题完全来自我阅读的内容。
我目前在 text-mining/analysis 术语方面很薄弱。
简而言之,您无法计算每个特征的 tf-idf 值,将其与其文档上下文隔离开来,因为每个特征的 tf-idf 值都是特定于文档的。
更具体地说:
- (inverse) 文档频率是每个特征一个值,所以由 $j$
索引
- term frequency 是每个文档每个术语一个值,因此由 $ij$
索引
- tf-idf 因此由 $i,j$ 索引
您可以在示例中看到这一点:
> tweets_dfm %>%
+ dfm_tfidf() %>%
+ dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
+ as.matrix()
features
docs \U0001f914 \U0001f4aa \U0001f603 \U0001f953 \U0001f37a
text1 1.39794 1 0 0 0
text2 0.00000 0 1 0 0
text3 0.00000 0 0 0 0
text4 0.00000 0 0 0 0
text5 0.00000 0 0 0 0
text6 0.69897 0 0 0 0
text7 0.00000 0 0 1 1
text8 0.00000 0 0 0 0
text9 0.00000 0 0 0 0
text10 0.00000 0 0 0 0
还有两件事:
考虑到反向文档频率的特征已经是一种平均值,或者至少是出现术语的文档的反比例,按特征进行平均并不是真正有意义的事情。此外,这通常会被记录下来,因此在计算平均值之前需要进行一些转换。
上面,我计算了 tf-idf before 删除其他特征,因为如果你使用 relative ("normalized" ) 词频。 dfm_tfidf()
默认使用词条计数,所以这里的结果不受影响。
过去,我在为我的一个文档构建 tf-idf 时得到了帮助,并得到了我想要的输出(请参见下文)。
TagSet <- data.frame(emoticon = c("","","","",""),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in tbaconhe past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
# document
# 1 text1 1.39794 1 0 0 0
# 2 text2 0.00000 0 1 0 0
# 3 text3 0.00000 0 0 0 0
# 4 text4 0.00000 0 0 0 0
# 5 text5 0.00000 0 0 0 0
# 6 text6 0.69897 0 0 0 0
# 7 text7 0.00000 0 0 1 1
# 8 text8 0.00000 0 0 0 0
# 9 text9 0.00000 0 0 0 0
# 10 text10 0.00000 0 0 0 0
但我需要一些帮助来计算每个单项的 tf-idf。意思是,如何从矩阵中准确获取每个术语的 tf-idf 值?
# terms tfidf
# #its tfidf the correct way
# #its tfidf the correct way
# #its tfidf the correct way
# #its tfidf the correct way
# #its tfidf the correct way
我敢肯定,这不像是从其矩阵列中为术语添加所有 tf-idf,然后除以它出现的文档。这就是该术语的价值。
我已经查看了一些来源,例如此处 https://stats.stackexchange.com/questions/422750/how-to-calculate-tf-idf-for-a-single-term,但作者提出的问题完全来自我阅读的内容。
我目前在 text-mining/analysis 术语方面很薄弱。
简而言之,您无法计算每个特征的 tf-idf 值,将其与其文档上下文隔离开来,因为每个特征的 tf-idf 值都是特定于文档的。
更具体地说:
- (inverse) 文档频率是每个特征一个值,所以由 $j$ 索引
- term frequency 是每个文档每个术语一个值,因此由 $ij$ 索引
- tf-idf 因此由 $i,j$ 索引
您可以在示例中看到这一点:
> tweets_dfm %>%
+ dfm_tfidf() %>%
+ dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
+ as.matrix()
features
docs \U0001f914 \U0001f4aa \U0001f603 \U0001f953 \U0001f37a
text1 1.39794 1 0 0 0
text2 0.00000 0 1 0 0
text3 0.00000 0 0 0 0
text4 0.00000 0 0 0 0
text5 0.00000 0 0 0 0
text6 0.69897 0 0 0 0
text7 0.00000 0 0 1 1
text8 0.00000 0 0 0 0
text9 0.00000 0 0 0 0
text10 0.00000 0 0 0 0
还有两件事:
考虑到反向文档频率的特征已经是一种平均值,或者至少是出现术语的文档的反比例,按特征进行平均并不是真正有意义的事情。此外,这通常会被记录下来,因此在计算平均值之前需要进行一些转换。
上面,我计算了 tf-idf before 删除其他特征,因为如果你使用 relative ("normalized" ) 词频。
dfm_tfidf()
默认使用词条计数,所以这里的结果不受影响。