R - tm 包:减少用于创建术语邻接可视化的术语矩阵的数量

R - tm package: Reduce the number of term matrix for the creation of a term-term adjacency visualization

我在对我的语料库进行可重现的术语-术语邻接可视化时遇到问题,它有大约 80 万个单词。

我正在关注 a tutorial,其词项矩阵仅包含 20 个词项,因此,结果是最优的:

我想,我的问题是我无法将我的术语矩阵减少到,比方说,我的语料库中 50 个最相关的术语。我在 SO 外部的一个网站上找到了一条评论,这可能会有所帮助,但我无法根据我的需要对其进行调整。在此评论中,有人说,当我创建 Term 矩阵时我应该玩我的界限,所以我以这段代码结束:

dtm2 <- DocumentTermMatrix(ds4.1g, control=list(wordLengths=c(1, Inf), +
bounds=list(global=c(floor(length(ds4.1g)*0.3), floor(length(ds4.1g)*0.6)))))


tdm92.1g2 <- removeSparseTerms(dtm2, 0.99)

tdm2.1g2 <- tdm92.1g2

# Creates a Boolean matrix (counts # docs w/terms, not raw # terms)
tdm3.1g <- inspect(tdm2.1g2)
tdm3.1g[tdm3.1g>=1] <- 1 

# Transform into a term-term adjacency matrix
termMatrix.1gram <- tdm3.1g %*% t(tdm3.1g)

因此,如果我理解正确,我可以让词条矩阵只获取那些至少出现在我的文档中 30%,但不超过 60% 的词条。

然而,无论我如何定义这个边界,我的术语矩阵 termMatrix.1gram 总是有 115K 个元素,这使得我无法实现我想要的可视化。有没有办法将这些元素限制为仅 50 个元素?

如何获取语料库?

为了清楚起见,我在这里写下我使用 tm 包生成语料库的代码:

#specify where is the directory of the files.
folderdir <- paste0(dirname(myFile),"/", project, "/")

#load the corpus.
corpus <- Corpus(DirSource(folderdir, encoding = "UTF-8"), readerControl=list(reader=readPlain,language="de"))
#cleanse the corpus.
ds0.1g <- tm_map(corpus, content_transformer(tolower))
ds1.1g <- tm_map(ds0.1g, content_transformer(removeWords), stopwords("german"))
ds2.1g <- tm_map(ds1.1g, stripWhitespace)
ds3.1g <- tm_map(ds2.1g, removePunctuation)
ds4.1g <- tm_map(ds3.1g, stemDocument)
ds4.1g <- tm_map(ds4.1g, removeNumbers)
ds5.1g   <- tm_map(ds4.1g, content_transformer(removeWords), c("a", "b", "c", "d", "e", "f","g","h","i","j","k","l",
                                                               "m","n","o","p","q","r","s","t","u","v","w","x","y","z"))
#create matrixes.
tdm.1g <- TermDocumentMatrix(ds4.1g)
dtm.1g <- DocumentTermMatrix(ds4.1g)
#reduce the sparcity.
tdm89.1g <- removeSparseTerms(tdm.1g, 0.89)
tdm9.1g  <- removeSparseTerms(tdm.1g, 0.9)
tdm91.1g <- removeSparseTerms(tdm.1g, 0.91)
tdm92.1g <- removeSparseTerms(tdm.1g, 0.92)

tdm2.1g <- tdm92.1g

如您所见,这是使用 tm 包获取它的传统方法。文字原本分别保存在我电脑的一个文件夹中的不同txt文档中。

my problem is that I am not able to reduce my term matrix to, lets say, the 50 most relevant terms

如果"relevancy"表示频率,你可以这样做:

library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
dtm <- DocumentTermMatrix(crude)
head(as.matrix(tdm))
tdm <- tdm[names(tail(sort(rowSums(as.matrix(tdm))), 50)), ]
tdm
# <<TermDocumentMatrix (terms: 50, documents: 20)>>
# ...
dtm <- dtm[, names(tail(sort(colSums(as.matrix(dtm))), 50))]
inspect(dtm)
# <<DocumentTermMatrix (documents: 20, terms: 50)>>
# ...

@agustin:如果 "relevance" 是指预先确定的特定术语(可以命名为实体、组织或短语),您可以将其子集添加到特定术语列表中。例如,在原始数据集中,您可能希望检查 "oil prices are expected to rise"、"oil prices are expected to fall"、"nigerian conflict"、"iran oil" 和 "severe US winter": tdm <- TermDocumentMatrix(crude) short.list<-c("oil prices are expected to rise", "oil prices are expected to fall", "nigerian conflict", "iran oil" and "severe US winter") tdm.short.list<-tdm[rownames(tdm)%in%short.list,] HTH

为了减少术语数量,我更喜欢使用 quanteda 包,因为您可以选择要使用的确切术语数量,然后如果需要执行其他功能,则将文档特征矩阵转换为其他对象类型.

topfeatures() returns 排名靠前的 n 个术语计数。通过获取向量的 labels() 来访问术语。

然后,您可以通过向下钻取特征名称来对您的 quanteda dfm 进行子集化。

这是我的项目中的一个例子,我从超过 120K 的术语减少到只有 16K:

library(quanteda)
length(char_vec)

# [1] 758917

train.tokens <- tokens(char_vec, what = "word", ngrams = 1)
train.tokens <- tokens_select(train.tokens, stopwords(), selection = "remove")
train.tokens.dfm <- dfm(train.tokens)
dim(train.tokens.dfm)

# [1] 758917 128560

a <- topfeatures(train.tokens.dfm, n = 16000, decreasing = TRUE, scheme = c("count", "docfreq"), groups = NULL)
# Be sure to take the labels, because those are your terms you will use to search later
b <- labels(a)
length(b)

# [1] 16000

head(b)

# [1] "say"  "can"  "much" "will" "good" "get" 

train.tokens.dfm <- train.tokens.dfm[, which(train.tokens.dfm@Dimnames$features %in% b)]
dim(train.tokens.dfm)

# [1] 758917  16000

这不是最短的答案,但它非常有效。

现在您可以将 dfm 转换为 tm 包中使用的 dtm。

dtm <- convert(train.tokens.dfm, to = "tm", docvars = NULL)
class(dtm)

# [1] "DocumentTermMatrix"    "simple_triplet_matrix"

从这里您可以使用 tm 包将 dtm 转换为术语文档矩阵,这似乎是您所需要的。

tdm <- as.TermDocumentMatrix(dtm)
class(tdm)

# [1] "TermDocumentMatrix"    "simple_triplet_matrix"

dim(tdm)

# [1]  16000 758917

从这一点开始,您应该能够执行邻接可视化。