单个术语的多个共现集群
multiple co-occurence clusters on single term
我有一个关键术语至少出现一次的语料库。由此我制作了看起来很像这样的 fcm。
txts <- c("a a a b b c", "a a c e", "a c b e f g", "e d j b", "b g k l", "b a a g l", "e c b j k l", "b g w m")
total <- fcm(txts, context = "document", count = "frequency")
Feature co-occurrence matrix of: 12 by 12 features.
12 x 12 sparse Matrix of class "fcm"
features
features a b c e f g d j k l w m
a 5 9 6 3 1 3 0 0 0 2 0 0
b 0 1 4 3 1 4 1 2 2 3 1 1
c 0 0 0 3 1 1 0 1 1 1 0 0
e 0 0 0 0 1 1 1 2 1 1 0 0
f 0 0 0 0 0 1 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 1 2 1 1
d 0 0 0 0 0 0 0 1 0 0 0 0
j 0 0 0 0 0 0 0 0 1 1 0 0
k 0 0 0 0 0 0 0 0 0 2 0 0
l 0 0 0 0 0 0 0 0 0 0 0 0
w 0 0 0 0 0 0 0 0 0 0 0 1
m 0 0 0 0 0 0 0 0 0 0 0 0
据此,我想找到 'b' 周围的不同集群。
着眼于缩放,我的实际 fcm 有 239104369 个元素,大小为 1.2GB。
前 10 个特征的矩阵如下所示
Feature co-occurrence matrix of: 10 by 10 features.
10 x 10 sparse Matrix of class "fcm"
features
features international monetary fund development association bolivia assessment interim poverty reduction
international 2885797 1345055 3340282 12013377 857864 199985 605036 202117 3996710 1319199
monetary 0 227329 973979 2326677 234565 39802 93927 65773 884341 330250
fund 0 0 1766657 6530594 621315 99900 355415 204229 2534382 927737
development 0 0 0 20054398 1683896 485906 2235294 406575 13674085 4091506
association 0 0 0 0 122947 25954 87756 47038 580721 204144
bolivia 0 0 0 0 0 26062 35164 5336 254924 71428
assessment 0 0 0 0 0 0 203933 24196 1420850 377398
interim 0 0 0 0 0 0 0 20595 172870 67705
poverty 0 0 0 0 0 0 0 0 9131869 4026961
reduction 0 0 0 0 0 0 0 0 0 642944
我的目标是可视化围绕关键术语 (https://bost.ocks.org/mike/miserables/) 的集群并从中创建术语列表。
https://www.r-bloggers.com/turning-keywords-into-a-co-occurrence-network/
https://www.r-bloggers.com/collapsing-a-bipartite-co-occurrence-network/
Co occurrence plot in R
在我的搜索中,我也偶然发现了 cooccurNet 包,但我不知道如何使用它。 https://cran.r-project.org/web/packages/cooccurNet/index.html
quanteda 有 textstat_simil()
那个 returns 一个 dist
层次聚类对象。此函数仅采用 DFM,但可以使用 as.dfm()
将 FCM 转换为对象。
require(quanteda)
txt <- c("a a a b b c", "a a c e", "a c b e f g", "e d j b", "b g k l", "b a a g l", "e c b j k l", "b g w m")
dmt <- dfm(txt)
# dmt <- dfm_trim(dmt, min_termfreq = 10) # you might need this to reduce the size of fcm
fmt <- fcm(dmt, context = "document")
dist <- textstat_simil(as.dfm(fmt), margin = "features")
tree <- hclust(dist)
cutree(tree, 2)
我有一个关键术语至少出现一次的语料库。由此我制作了看起来很像这样的 fcm。
txts <- c("a a a b b c", "a a c e", "a c b e f g", "e d j b", "b g k l", "b a a g l", "e c b j k l", "b g w m")
total <- fcm(txts, context = "document", count = "frequency")
Feature co-occurrence matrix of: 12 by 12 features.
12 x 12 sparse Matrix of class "fcm"
features
features a b c e f g d j k l w m
a 5 9 6 3 1 3 0 0 0 2 0 0
b 0 1 4 3 1 4 1 2 2 3 1 1
c 0 0 0 3 1 1 0 1 1 1 0 0
e 0 0 0 0 1 1 1 2 1 1 0 0
f 0 0 0 0 0 1 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 1 2 1 1
d 0 0 0 0 0 0 0 1 0 0 0 0
j 0 0 0 0 0 0 0 0 1 1 0 0
k 0 0 0 0 0 0 0 0 0 2 0 0
l 0 0 0 0 0 0 0 0 0 0 0 0
w 0 0 0 0 0 0 0 0 0 0 0 1
m 0 0 0 0 0 0 0 0 0 0 0 0
据此,我想找到 'b' 周围的不同集群。
着眼于缩放,我的实际 fcm 有 239104369 个元素,大小为 1.2GB。
前 10 个特征的矩阵如下所示
Feature co-occurrence matrix of: 10 by 10 features.
10 x 10 sparse Matrix of class "fcm"
features
features international monetary fund development association bolivia assessment interim poverty reduction
international 2885797 1345055 3340282 12013377 857864 199985 605036 202117 3996710 1319199
monetary 0 227329 973979 2326677 234565 39802 93927 65773 884341 330250
fund 0 0 1766657 6530594 621315 99900 355415 204229 2534382 927737
development 0 0 0 20054398 1683896 485906 2235294 406575 13674085 4091506
association 0 0 0 0 122947 25954 87756 47038 580721 204144
bolivia 0 0 0 0 0 26062 35164 5336 254924 71428
assessment 0 0 0 0 0 0 203933 24196 1420850 377398
interim 0 0 0 0 0 0 0 20595 172870 67705
poverty 0 0 0 0 0 0 0 0 9131869 4026961
reduction 0 0 0 0 0 0 0 0 0 642944
我的目标是可视化围绕关键术语 (https://bost.ocks.org/mike/miserables/) 的集群并从中创建术语列表。
https://www.r-bloggers.com/turning-keywords-into-a-co-occurrence-network/
https://www.r-bloggers.com/collapsing-a-bipartite-co-occurrence-network/
Co occurrence plot in R
在我的搜索中,我也偶然发现了 cooccurNet 包,但我不知道如何使用它。 https://cran.r-project.org/web/packages/cooccurNet/index.html
quanteda 有 textstat_simil()
那个 returns 一个 dist
层次聚类对象。此函数仅采用 DFM,但可以使用 as.dfm()
将 FCM 转换为对象。
require(quanteda)
txt <- c("a a a b b c", "a a c e", "a c b e f g", "e d j b", "b g k l", "b a a g l", "e c b j k l", "b g w m")
dmt <- dfm(txt)
# dmt <- dfm_trim(dmt, min_termfreq = 10) # you might need this to reduce the size of fcm
fmt <- fcm(dmt, context = "document")
dist <- textstat_simil(as.dfm(fmt), margin = "features")
tree <- hclust(dist)
cutree(tree, 2)