从每个文档的唯一 words/terms 访问某些元素

Question

this 代码将输出作为矩阵给出。但是这里应该避免像is,am, i这样的重复词。我只想要一个包含 cool ,mark 和 neo4j 的矩阵。我试过 grep("cool",tdm)。它在这里不起作用。有没有其他方法？

output: tdm
       Docs
Terms   1 2
  am    2 0
  cool  0 2
  i     2 0
  is    0 2
  mark  2 0
  neo4j 0 2

Answer 1

基于您的示例的小示例代码。

library(tm)
text <- c("I am Mark I am Mark", "Neo4j is cool Neo4j is cool")
corpus <- VCorpus(VectorSource(text))

# wordLengths set to 3, basicly the default removes all words of length 1 and 2
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf)))
as.matrix(tdm)

# only words cool and mark
# create a dictionary
my_dict <- c("cool", "mark")
tdm <- TermDocumentMatrix(corpus, control = list(dictionary = dict ))
as.matrix(tdm)
      Docs
Terms  1 2
  cool 0 2
  mark 2 0

小心将文档术语矩阵转换为普通矩阵。如果您有很多文本，这会占用大量内存。

但是看看你的问题，你需要阅读有关文本挖掘的内容。

这里开始 tidy text-mining

这里是关于使用 quanteda

进行文本挖掘的信息

并阅读 vignette of tm

当然还要搜索 SO 以获取示例。已经以某种方式回答了很多问题。

从每个文档的唯一 words/terms 访问某些元素

Accessing certain elements from the unique words/terms per document

r

tm