列出语料库中所有用卡方检验拒绝原假设的词

List all the words in corpus that reject null hypothesis with chi-squared test

我有一个脚本列出前 n 个词(卡方值较高的词)。但是,我不想提取固定的 n 个单词,而是提取 p 值小于 0.05 的所有单词,即拒绝原假设。

这是我的代码:

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2score = chi2(X_tfidf, y)[0]
scores = list(zip(tfidf.get_feature_names(), chi2score))
chi2 = sorted(scores, key=lambda x:x[1])
allchi2 = list(zip(*chi2))

#lists top 20 words
allchi2 = allchi2[0][-20:]

所以,在这种情况下,我想要的不是列出前 20 个词,而是拒绝原假设的所有词,即评论中依赖于情绪的所有词 class(正面或负面)

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2_score, pval_score = chi2(X_tfidf, y)
feature_pval_items = filter(lambda x:x[1]<0.05, zip(tfidf.get_feature_names(), pval_score))
you_want_feature_pval_items = sorted(feature_pval_items, key=lambda x:x[1])