如何在 quanteda 中计算卡方关联/键度?

How is chi-squared association / keyness calculated in quanteda?

我正在尝试了解目标组和参考组中关键字的关联(或相关性)背后的卡方计算。

library(quanteda)    
pres_corpus <- corpus_subset(data_corpus_inaugural, President %in% c("Obama", "Trump"))

# Remove Punctuation and Numbers
tokensAll <- tokens(pres_corpus, remove_punct = TRUE, remove_numbers= TRUE)

# Removing stopwords before constructing bigrams
tokensNoStopwords <- tokens_remove(tokensAll, stopwords("english"))

# Bigram
tokensNgramsNoStopwords <- tokens_ngrams(tokensNoStopwords,  n=2, concatenator = "_")
dtm = dfm(tokensNgramsNoStopwords, tolower = TRUE, groups = "President")

# Calculate keyness and determine Trump as target group
(result_keyness <- textstat_keyness(dtm, target = "Obama"))[1]

手工计算textstat_keyness()如下图-

# Number of words
sums <- rowSums(dtm)

# frequency of target
a = as.numeric(dtm[1,1])

# frequency of reference
b = as.numeric(dtm[2,1])

# total of all target words minus freq. of target
c = sums[1] - a

# total of all reference words minus freq. of reference
d = sums[2] - b

N = (a+b+c+d)
E = (a+b)*(a+c) / N
(N * abs(a*d - b*c)^2) / ((a+b)*(c+d)*(a+c)*(b+d)) * ifelse(a > E, 1, -1)

它与从 textstat_keyness( ) 函数得出的分数相匹配。但是,如果我使用 chisq.test()-

它不匹配
(tt = as.table(rbind(c(a, b), c(c, d))))
suppressWarnings(chi <- stats::chisq.test(tt))
(t_exp <- chi$expected[1,1])
(chi2 = unname(chi$statistic) * ifelse(tt > t_exp, 1, -1))

区别在于应用Yates's correction进行2x2 Chi-squared测试。 chisq.test() 默认应用更正。在您的手动计算中,您没有应用更正。

所以:

textstat_keyness(dtm, target = "Obama")[1]
##           feature     chi2        p n_target n_reference
## 1 fellow_citizens 0.647129 0.421141        2           0

并且没有更正:

chisq.test(tt, correct = FALSE)

##  Pearson's Chi-squared test
## 
## data:  tt
## X-squared = 0.64713, df = 1, p-value = 0.4211