使用 quanteda 进行 R 文本挖掘
R Text Mining with quanteda
我有一个数据集(Facebook 帖子)(通过 netvizz),我在 R 中使用 quanteda 包。这是我的 R 代码。
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
一切正常直到:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
您如何解读错误消息?有什么解决问题的建议吗?
quanteda 版本 0.7.2 中存在一个错误,导致 dfm()
在其中一个文档不包含任何特征的情况下使用字典时失败。你的例子失败了,因为在清理阶段,一些 Facebook post "documents" 最终通过清理步骤删除了他们的所有功能。
这不仅在 0.8.0 中得到修复,而且我们在 dfm()
中更改了字典的底层实现,从而显着提高了速度。 (LIWC 仍然是一个庞大而复杂的字典,正则表达式仍然意味着它比简单的索引令牌使用起来要慢得多。我们将进一步优化它。)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
如果文档在标记化和清理后包含零特征,它也将起作用,这可能是破坏旧的 dfm
您使用 Facebook 文本的原因。
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3
我有一个数据集(Facebook 帖子)(通过 netvizz),我在 R 中使用 quanteda 包。这是我的 R 代码。
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
一切正常直到:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
您如何解读错误消息?有什么解决问题的建议吗?
quanteda 版本 0.7.2 中存在一个错误,导致 dfm()
在其中一个文档不包含任何特征的情况下使用字典时失败。你的例子失败了,因为在清理阶段,一些 Facebook post "documents" 最终通过清理步骤删除了他们的所有功能。
这不仅在 0.8.0 中得到修复,而且我们在 dfm()
中更改了字典的底层实现,从而显着提高了速度。 (LIWC 仍然是一个庞大而复杂的字典,正则表达式仍然意味着它比简单的索引令牌使用起来要慢得多。我们将进一步优化它。)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
如果文档在标记化和清理后包含零特征,它也将起作用,这可能是破坏旧的 dfm
您使用 Facebook 文本的原因。
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3