如何分别使用 TermDocumentMatrix() 和 DocumentTermMatrix() 解决数据丢失和错误?
How do I resolve dataloss & error with TermDocumentMatrix() and DocumentTermMatrix(), respectively?
我有 1000 个样本的 Twitter 数据。并尝试对它们进行一些 tf 和 tf-idf 分析,以衡量每个表情符号在推文中的重要性。共有437个独特表情,810条推文。
我目前的问题是 TermDocumentMatrix
,所有条款都没有显示。然而,对于 DocumentTermMatrix
存在一个我无法解决的错误。这是一个有效的代码片段:
library(dplyr)
library(tidytext)
library(tm)
library(tidyr)
#These are NOT from the my data, these are random fake bios I found online just to make this code snippet
tweets_data <- c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |")
emoticons_data <- c("","","","","")
TagSet <- data.frame(emoticons_data)
colnames(TagSet) <- "emoticon"
TextSet <- data.frame(tweets_data)
colnames(TextSet) <- "tweet"
myCorpus <- tm::Corpus(tm::VectorSource(TextSet$tweet))
tdm <- tm::TermDocumentMatrix(myCorpus, control= list(stopwords=T))
tdm_onlytags <- tdm[rownames(tdm)%in%TagSet$emoticon, ]
tm::inspect(tdm_onlytags) #Only shows 1 terms, and not 5
#View(as.matrix(tdm_onlytags[1:tdm_onlytags$nrow, 1:tdm_onlytags$ncol])) #just to see in new window
此外,如果我尝试执行 tf-idf,我只会收到错误消息。我环顾四周,但我不知道应该在哪里纠正我的错误。
tdm <- tm::as.DocumentTermMatrix(myCorpus, control= list(weighting= weightTfIdf))
tdm #Original= Error in dim(data) <- dim : dims [product 810] do not match the length of object [3]
这是我第一次使用 tm
包。
我稍微更改了你的原始数据,因为你的表情符号每个只在文本中出现一次,这将 tfidf 中的所有值都变为 1(见下文,我只是随机添加了几个)。我正在使用 quanteda
而不是 tm
,因为它速度更快并且编码问题更少。
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
#> document <U+0001F914> <U+0001F4AA> <U+0001F603> <U+0001F953> <U+0001F37A>
#> 1 text1 1.39794 1 0 0 0
#> 2 text2 0.00000 0 1 0 0
#> 3 text3 0.00000 0 0 0 0
#> 4 text4 0.00000 0 0 0 0
#> 5 text5 0.00000 0 0 0 0
#> 6 text6 0.69897 0 0 0 0
#> 7 text7 0.00000 0 0 1 1
#> 8 text8 0.00000 0 0 0 0
#> 9 text9 0.00000 0 0 0 0
#> 10 text10 0.00000 0 0 0 0
列名称(即表情符号)在我的查看器中正确显示,应该可以导出结果 data.frame。
数据
TagSet <- data.frame(emoticon = c("","","","",""),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)
我有 1000 个样本的 Twitter 数据。并尝试对它们进行一些 tf 和 tf-idf 分析,以衡量每个表情符号在推文中的重要性。共有437个独特表情,810条推文。
我目前的问题是 TermDocumentMatrix
,所有条款都没有显示。然而,对于 DocumentTermMatrix
存在一个我无法解决的错误。这是一个有效的代码片段:
library(dplyr)
library(tidytext)
library(tm)
library(tidyr)
#These are NOT from the my data, these are random fake bios I found online just to make this code snippet
tweets_data <- c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |")
emoticons_data <- c("","","","","")
TagSet <- data.frame(emoticons_data)
colnames(TagSet) <- "emoticon"
TextSet <- data.frame(tweets_data)
colnames(TextSet) <- "tweet"
myCorpus <- tm::Corpus(tm::VectorSource(TextSet$tweet))
tdm <- tm::TermDocumentMatrix(myCorpus, control= list(stopwords=T))
tdm_onlytags <- tdm[rownames(tdm)%in%TagSet$emoticon, ]
tm::inspect(tdm_onlytags) #Only shows 1 terms, and not 5
#View(as.matrix(tdm_onlytags[1:tdm_onlytags$nrow, 1:tdm_onlytags$ncol])) #just to see in new window
此外,如果我尝试执行 tf-idf,我只会收到错误消息。我环顾四周,但我不知道应该在哪里纠正我的错误。
tdm <- tm::as.DocumentTermMatrix(myCorpus, control= list(weighting= weightTfIdf))
tdm #Original= Error in dim(data) <- dim : dims [product 810] do not match the length of object [3]
这是我第一次使用 tm
包。
我稍微更改了你的原始数据,因为你的表情符号每个只在文本中出现一次,这将 tfidf 中的所有值都变为 1(见下文,我只是随机添加了几个)。我正在使用 quanteda
而不是 tm
,因为它速度更快并且编码问题更少。
library(dplyr)
library(quanteda)
tweets_dfm <- dfm(TextSet$tweet) # convert to document-feature matrix
tweets_dfm %>%
dfm_select(TagSet$emoticon) %>% # only leave emoticons in the dfm
dfm_tfidf() %>% # weight with tfidf
convert("data.frame") # turn into data.frame to display more easily
#> document <U+0001F914> <U+0001F4AA> <U+0001F603> <U+0001F953> <U+0001F37A>
#> 1 text1 1.39794 1 0 0 0
#> 2 text2 0.00000 0 1 0 0
#> 3 text3 0.00000 0 0 0 0
#> 4 text4 0.00000 0 0 0 0
#> 5 text5 0.00000 0 0 0 0
#> 6 text6 0.69897 0 0 0 0
#> 7 text7 0.00000 0 0 1 1
#> 8 text8 0.00000 0 0 0 0
#> 9 text9 0.00000 0 0 0 0
#> 10 text10 0.00000 0 0 0 0
列名称(即表情符号)在我的查看器中正确显示,应该可以导出结果 data.frame。
数据
TagSet <- data.frame(emoticon = c("","","","",""),
stringsAsFactors = FALSE)
TextSet <- data.frame(tweet = c("Sharp, adversarial⚔️~pro choice~ban Pit Bulls☠️~BSL️~aberant psychology~common sense~the Piper will lead us to reason~sealskin woman",
"Blocked by Owen, Adonis. Abbott & many #FBPE Love seaside, historic houses & gardens, family & pets. RTs & likes/ Follows may=interest not agreement ",
" #healthy #vegetarian #beatchronicillness fix infrastructure",
"LIBERTY-IDENTITARIAN. My bio, photo at Site Info. And kindly add my site to your Daily Favorites bar. Thank you, Eric",
"I #BackTheBlue for my son! Facts Over Feelings. Border Security saves lives! #ThankYouICE",
" I play Pedal Steel @CooderGraw & #CharlieShafter #GoStars #LiberalismIsAMentalDisorder",
"#Englishman #Londoner @Chelseafc ️♂️ ",
"F*** the Anti-White Agenda #Christian #Traditional #TradThot #TradGirl #European #MAGA #AltRight #Folk #Family #WhitePride",
"❄️Do not dwell in the past, do not dream of the future, concentrate the mind on the present moment.️❄️",
"Ordinary girl in a messed up World | Christian | Anti-War | Anti-Zionist | Pro-Life | Pro | Hello intro on the Minds Link |"),
stringsAsFactors = FALSE)