接收基于单词而不是每一行的单词集群

Question

我尝试使用这种方法

library(quanteda)

dataset1 <- data.frame( anumber = c(1,2,3), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.","It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source."))

myDfm <- dataset1 %>%
corpus() %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm()%>%                         


   dfm_trim(min_termfreq = 1)
        
tstat_dist <- textstat_simil(myDfm, method = "cosine")

# hiarchical clustering the distance object
pres_cluster <- hclust(as.dist(tstat_dist))
# label with document names
pres_cluster$labels <- docnames(myDfm)
# plot as a dendrogram
plot(pres_cluster, xlab = "", sub = "", main = "Cosine Distance on Token Frequency")

为单词提取词簇，但在最后的情节中，我收到了文档的名称，这是我拥有的每一行。是否可以进行任何更改以接收文本单词而不是群集中的文档名称？

我希望看到这样的话：

textstat_frequency(myDfm, n = 5)

  feature frequency rank docfreq group
1     the        10    1       3   all
2      of         7    2       3   all
3   lorem         6    3       3   all
4   ipsum         6    3       3   all
5       a         5    5       2   all

Answer 1

是 - 您在计算距离时需要 margin = "features" 参数。（你可以删除标签分配。）所以你的代码的最后一部分应该是：

# compute the distance on features, not documents
tstat_dist <- textstat_simil(myDfm, method = "cosine", margin = "features")
# hiarchical clustering the distance object
pres_cluster <- hclust(as.dist(tstat_dist))
# plot as a dendrogram
plot(pres_cluster, xlab = "", sub = "", main = "Cosine Distance on Token Frequency")

但是您应该计算距离度量，而不是用于计算层次聚类的余弦相似度。

接收基于单词而不是每一行的单词集群

Receive the word cluster based on words and not for every row

r

quanteda