将 CSV 格式的术语文档矩阵导入 R

Question

所以我已经有一个 TDM，但它在 excel 上。所以我将其保存为 CSV。现在我想做一些分析，但我不能使用 tm 包将它加载为 TDM。我的 CSV 看起来像这样：

           item01    item02    item03     item04


red         0          1         1           0
circle      1          0         0           1
fame        1          0         0           0
yellow      0          0         1           1 
square      1          0         1           0

所以我无法将该文件作为 TDM 加载，到目前为止我尝试过的最好的是：

myDTM <- as.DocumentTermMatrix(df, weighting = weightBin)

但它在所有单元格上加载 1

<<DocumentTermMatrix (documents: 2529, terms: 1952)>>
Non-/sparse entries: 4936608/0
Sparsity           : 0%
Maximal term length: 27
Weighting          : binary (bin)
Sample             :

             Terms
Docs            item01 item02 item03 item04
      Red        1        1     1       1                
      Circle     1        1     1       1          
      fame       1        1     1       1

我试过首先转换为语料库和其他东西，但如果我尝试使用任何函数，如 inspect(tdm)，它 returns 一个错误，像这样或类似的。

Error in `[.simple_triplet_matrix`(x, docs, terms) :

我真的不相信没有办法以正确的格式导入它，有什么建议吗？提前致谢。

Answer 1

首先尝试将 CSV 转换为稀疏矩阵。我的CSV和你的不一样，因为是我自己打的，但是思路是一样的。

> library(tm)
> library(Matrix)
> myDF <- read.csv("my.csv",row.names=1,colClasses=c('character',rep('integer',4)))
> mySM <- Matrix(as.matrix(myDF),sparse=TRUE)
> myDTM <- as.DocumentTermMatrix(mySM,weighting = weightBin)
> inspect(myDTM)

<<DocumentTermMatrix (documents: 5, terms: 4)>>
Non-/sparse entries: 7/13
Sparsity           : 65%
Maximal term length: 6
Weighting          : binary (bin)
Sample             :
        Terms
Docs     item01 item02 item03 item04
  circle      1      1      0      0
  fame        1      0      0      0
  red         0      0      0      0
  square      1      0      1      0
  yellow      0      0      1      1 
>

将 CSV 格式的术语文档矩阵导入 R

Importing a Term Document Matrix in CSV format into R

csv

r

text-analysis

text-mining