如何删除文档术语矩阵中的一列单词？

Question

我通过文档术语矩阵使用训练数据集训练了我的机器学习模型。我正在尝试预测我的测试数据集，但不幸的是它包含训练数据集没有的词。

我的问题是如何真正删除我的测试数据集中那些在训练数据集中找不到的词。

我正在使用 tm 包并创建了一个 DocumentTermMatrix。

Answer 1

一个简单的方法是使用 quanteda 文本分析包。构建文档特征矩阵后，您可以 select 从第二个 "dfm" 获得它的特征。这允许您为训练集构建 dfm，然后轻松地 select 测试集中与训练集中的特征相同的那些特征。

这是来自 ?selectFeatures 帮助页面的插图：

require(quanteda)
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
features(dfm1 <- dfm(textVec1))
#
#   ... lowercasing
#   ... tokenizing
#   ... indexing documents: 3 documents
#   ... indexing features: 8 feature types
#   ... created a 3 x 8 sparse dfm
#   ... complete. 
# Elapsed time: 0.077 seconds.
# [1] "this"   "is"     "text"   "one"    "the"    "second" "here"   "third" 

features(dfm2a <- dfm(textVec2))
#
#   ... lowercasing
#   ... tokenizing
#   ... indexing documents: 2 documents
#   ... indexing features: 7 feature types
#   ... created a 2 x 7 sparse dfm
#   ... complete. 
# Elapsed time: 0.006 seconds.
# [1] "here"  "are"   "new"   "words" "in"    "this"  "text" 

(dfm2b <- selectFeatures(dfm2a, dfm1))
# found 3 features from 8 supplied types in a dfm, padding 0s for another 5 
# Document-feature matrix of: 2 documents, 8 features.
# 2 x 8 sparse Matrix of class "dfmSparse"
#       this is text one the second here third
# text1    0  0    0   0   0      0    1     0
# text2    1  0    1   0   0      0    0     0
identical(features(dfm1), features(dfm2b))
# [1] TRUE

Answer 2

有几种方法可以做到这一点。一个是，当您使用训练数据创建 DTM 时，您有一个项目列表，然后您可以编写一个小函数来为您完成加入这些列表的工作，这里是一个示例，可能效率不高但应该有效：

dtm 是你用训练数据构建的语料库。 Document 是您要用来评估模型的新文档：

    DocumentVectortfidf<-function (dtm, Document) #
        {

          corpus <- Corpus(DataframeSource(Document))

          dtm1 <- DocumentTermMatrix(corpus) #this is from your Document

    #I created 2 dataframes and then merge them.

          Data<-data.frame(dtm1$dimnames$Terms,dtm1$v)
          colnames(Data)[1:2]<-c("Words","Frequency")

          Matrixwords<-data.frame(dtm$dimnames$Terms,0)
          colnames(Matrixwords)[1]<-"Words"

          Joint<-merge(Matrixwords,Data, by="Words", all.x = T, sort=T)
          Joint$Frequency<-ifelse(is.na(Joint$Frequency),0,Joint$Frequency)

         # This is optional if you want tf or tfidf, just change this, important!!
 tf uses only values from the Document, but tfidf uses numbers along the entire
 list of documents, so you use dtm for this. 
         # cs <- col_sums(dtm > 0)
         # lnrs <- log2(nDocs(dtm)/cs)

      DocumentVector<-data.frame(t(Joint$Frequency*lnrs))

      DocumentVector 
    }

现在有几种不同的方法可以做到这一点，也可以作为字典来完成，所以我们从 dtm（具有训练数据的那个）中提取单词列表，然后在创建时使用这个列表作为字典新文档的 dtm1。希望对您有所帮助。

如何删除文档术语矩阵中的一列单词？

How to remove a column of words in a document term matrix?

text-mining

tm