为什么 Quanteda 频率的列/行的结果不同。共现矩阵?
Why are results different for column / row of a Quanteda freq. co-occurence matrix?
我正在尝试使用 Quanteda 计算一个季度内不同术语与特定术语(例如 Vietnam 或“越南”)同时出现的次数。
但是当我select来自频率共现矩阵的列或行时,计数是不同的。
谁能告诉我为什么会这样或者我做错了什么?我担心我基于这些结果的分析不正确。
##Producing the FCM
> corp <- corpus(data_SCS14q4)
> toks <- tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop) %>% tokens_compound(phrase("东 盟"), concatenator = "")
> fcm_14q4 <- fcm(toks, context = "window")
##taking the row for Vietnam or "越南":
mt <- fcm_14q4["越南",]
> head(mt)
Feature co-occurrence matrix of: 1 by 6 features.
features
features 印 司令 中国 2050 收复 台湾
越南 0 0 0 0 0 0
##Taking the column for Vietnam or "越南":
> mt2 <- fcm_14q4[,"越南"]
> head(mt2)
Feature co-occurrence matrix of: 6 by 1 feature.
features
features 越南
印 0
司令 0
中国 68
2050 0
收复 8
台湾 4
这是因为默认情况下,fcm()
returns只有对称共生矩阵的上三角(ordered = FALSE
时对称)。要使两个索引切片相等,您需要指定 tri = FALSE
.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
toks <- tokens(c("a a a b b c", "a a c e", "a c e f g"))
# default is only upper triangle
fcm(toks, context = "window", window = 2, tri = TRUE)
## Feature co-occurrence matrix of: 6 by 6 features.
## features
## features a b c e f g
## a 8 3 3 2 0 0
## b 0 2 2 0 0 0
## c 0 0 0 2 1 0
## e 0 0 0 0 1 1
## f 0 0 0 0 0 1
## g 0 0 0 0 0 0
这可以使它对称,在这种情况下索引切片是相同的:
fcmat2 <- fcm(toks, context = "window", window = 2, tri = FALSE)
fcmat2
## Feature co-occurrence matrix of: 6 by 6 features.
## features
## features a b c e f g
## a 8 3 3 2 0 0
## b 3 2 2 0 0 0
## c 3 2 0 2 1 0
## e 2 0 2 0 1 1
## f 0 0 1 1 0 1
## g 0 0 0 1 1 0
fcmat2[, "a"]
## Feature co-occurrence matrix of: 6 by 1 features.
## features
## features a
## a 8
## b 3
## c 3
## e 2
## f 0
## g 0
t(fcmat2["a", ])
## Feature co-occurrence matrix of: 6 by 1 features.
## features
## features a
## a 8
## b 3
## c 3
## e 2
## f 0
## g 0
我正在尝试使用 Quanteda 计算一个季度内不同术语与特定术语(例如 Vietnam 或“越南”)同时出现的次数。
但是当我select来自频率共现矩阵的列或行时,计数是不同的。
谁能告诉我为什么会这样或者我做错了什么?我担心我基于这些结果的分析不正确。
##Producing the FCM
> corp <- corpus(data_SCS14q4)
> toks <- tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop) %>% tokens_compound(phrase("东 盟"), concatenator = "")
> fcm_14q4 <- fcm(toks, context = "window")
##taking the row for Vietnam or "越南":
mt <- fcm_14q4["越南",]
> head(mt)
Feature co-occurrence matrix of: 1 by 6 features.
features
features 印 司令 中国 2050 收复 台湾
越南 0 0 0 0 0 0
##Taking the column for Vietnam or "越南":
> mt2 <- fcm_14q4[,"越南"]
> head(mt2)
Feature co-occurrence matrix of: 6 by 1 feature.
features
features 越南
印 0
司令 0
中国 68
2050 0
收复 8
台湾 4
这是因为默认情况下,fcm()
returns只有对称共生矩阵的上三角(ordered = FALSE
时对称)。要使两个索引切片相等,您需要指定 tri = FALSE
.
library("quanteda")
## Package version: 3.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
toks <- tokens(c("a a a b b c", "a a c e", "a c e f g"))
# default is only upper triangle
fcm(toks, context = "window", window = 2, tri = TRUE)
## Feature co-occurrence matrix of: 6 by 6 features.
## features
## features a b c e f g
## a 8 3 3 2 0 0
## b 0 2 2 0 0 0
## c 0 0 0 2 1 0
## e 0 0 0 0 1 1
## f 0 0 0 0 0 1
## g 0 0 0 0 0 0
这可以使它对称,在这种情况下索引切片是相同的:
fcmat2 <- fcm(toks, context = "window", window = 2, tri = FALSE)
fcmat2
## Feature co-occurrence matrix of: 6 by 6 features.
## features
## features a b c e f g
## a 8 3 3 2 0 0
## b 3 2 2 0 0 0
## c 3 2 0 2 1 0
## e 2 0 2 0 1 1
## f 0 0 1 1 0 1
## g 0 0 0 1 1 0
fcmat2[, "a"]
## Feature co-occurrence matrix of: 6 by 1 features.
## features
## features a
## a 8
## b 3
## c 3
## e 2
## f 0
## g 0
t(fcmat2["a", ])
## Feature co-occurrence matrix of: 6 by 1 features.
## features
## features a
## a 8
## b 3
## c 3
## e 2
## f 0
## g 0