TermDocumentMatrix 的 'dictionary' 参数在 R 中不起作用
The 'dictionary' parameter of TermDocumentMatrix does not work in R
即使我按照下面的代码将关键字添加到 'dictionary',它也不会从句子中提取出来。
示例代码
library(tm)
data = c('a', 'a b', 'c')
keyword = c('a', 'b')
data = VectorSource(data)
corpus = VCorpus(data)
tdm = TermDocumentMatrix(corpus, control = list(dictionary = keyword))
我上面的代码的结果
inspect(tdm)
<<TermDocumentMatrix (terms: 2, documents: 3)>>
Non-/sparse entries: 0/6
Sparsity : 100%
Maximal term length: 1
Weighting : term frequency (tf)
Sample :
Docs
Terms 1 2 3
a 0 0 0
b 0 0 0
正常结果应该是这样的:
Terms 1 2 3
a 1 1 0
b 0 1 0
您必须将最小字长传递给 termFreq
control
。
tdm = TermDocumentMatrix(corpus, control = list(dictionary = keyword, wordLengths = c(1, Inf)))
as.matrix(tdm)
Docs
Terms 1 2 3
a 1 1 0
b 0 1 0
即使我按照下面的代码将关键字添加到 'dictionary',它也不会从句子中提取出来。
示例代码
library(tm)
data = c('a', 'a b', 'c')
keyword = c('a', 'b')
data = VectorSource(data)
corpus = VCorpus(data)
tdm = TermDocumentMatrix(corpus, control = list(dictionary = keyword))
我上面的代码的结果
inspect(tdm)
<<TermDocumentMatrix (terms: 2, documents: 3)>>
Non-/sparse entries: 0/6
Sparsity : 100%
Maximal term length: 1
Weighting : term frequency (tf)
Sample :
Docs
Terms 1 2 3
a 0 0 0
b 0 0 0
正常结果应该是这样的:
Terms 1 2 3
a 1 1 0
b 0 1 0
您必须将最小字长传递给 termFreq
control
。
tdm = TermDocumentMatrix(corpus, control = list(dictionary = keyword, wordLengths = c(1, Inf)))
as.matrix(tdm)
Docs
Terms 1 2 3
a 1 1 0
b 0 1 0