使用卡方检验在文档特征矩阵中进行特征选择

Feature selection in document-feature matrix by using chi-squared test

我正在使用自然语言处理进行短信挖掘。我使用 quanteda 包来生成文档特征矩阵 (dfm)。现在我想使用卡方检验进行特征选择。 我知道已经有很多人问过这个问题。但是,我找不到相关的代码。 (答案只是给出了一个简单的概念,像这样:https://stats.stackexchange.com/questions/93101/how-can-i-perform-a-chi-square-test-to-do-feature-selection-in-r

我了解到我可以在 FSelector 包中使用 chi.squared 但我不知道如何将此功能应用于 dfm class 对象(下面的 trainingtfidf ). (在手册中显示,它适用于预测变量)

有人可以给我提示吗?我很感激!

示例代码:

description <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",  "M6 is 13 days out of the visit window")
code <- c(4,3,6)
example <- data.frame(description, code)

library(quanteda)
trainingcorpus <- corpus(example$description)

trainingdfm <- dfm(trainingcorpus, verbose = TRUE, stem=TRUE, toLower=TRUE, removePunct= TRUE, removeSeparators=TRUE, language="english", ignoredFeatures = stopwords("english"), removeNumbers=TRUE, ngrams = 2)

# tf-idf
trainingtfidf <- tfidf(trainingdfm, normalize=TRUE)

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

这是计算特征卡方值的通用方法。它要求你有一些变量来形成关联,这里可以是你用来训练分类器的一些分类变量。

请注意,我在 quanteda 包中展示了如何执行此操作,但结果应该足够通用以适用于其他文本包矩阵对象。在这里,我使用的数据来自辅助 quantedaData 包,其中包含美国总统的所有国情咨文地址。

data(data_corpus_sotu, package = "quanteda.corpora")
table(docvars(data_corpus_sotu, "party"))
## Democratic Democratic-Republican            Federalist           Independent 
##         90                    28                     4                     8 
## Republican                  Whig 
##         9                     8 
sotuDemRep <- corpus_subset(data_corpus_sotu, party %in% c("Democratic", "Republican"))

# make the document-feature matrix for just Reps and Dems
sotuDfm <- dfm(sotuDemRep, remove = stopwords("english"))

# compute chi-squared values for each feature
chi2vals <- apply(sotuDfm, 2, function(x) { 
    chisq.test(as.numeric(x), docvars(sotuDemRep, "party"))$statistic
})

head(sort(chi2vals, decreasing = TRUE), 10)
## government       will     united     states       year     public   congress       upon 
##   85.19783   74.55845   68.62642   66.57434   64.30859   63.19322   59.49949   57.83603 
##        war     people 
##   57.43142   57.38697 

现在可以使用 dfm_select() 命令 select 编辑它们。 (请注意,按名称对列进行索引也可以。)

# select just 100 top Chi^2 vals from dfm
dfmTop100cs <- dfm_select(sotuDfm, names(head(sort(chi2vals, decreasing = TRUE), 100)))
## kept 100 features, from 100 supplied (glob) feature types

head(dfmTop100cs)
## Document-feature matrix of: 182 documents, 100 features.
## (showing first 6 documents and first 6 features)
##               features
## docs           citizens government upon duties constitution present
##   Jackson-1830       14         68   67     12           17      23
##   Jackson-1831       21         26   13      7            5      22
##   Jackson-1832       17         36   23     11           11      18
##   Jackson-1829       17         58   37     16            7      17
##   Jackson-1833       14         43   27     18            1      17
##   Jackson-1834       24         74   67     11           11      29

添加:使用 >= v0.9.9 可以使用 textstat_keyness() 函数完成此操作。

# to avoid empty factors
docvars(data_corpus_sotu, "party") <- as.character(docvars(data_corpus_sotu, "party"))

# make the document-feature matrix for just Reps and Dems
sotuDfm <- data_corpus_sotu %>%
    corpus_subset(party %in% c("Democratic", "Republican")) %>%
    dfm(remove = stopwords("english"))

chi2vals <- dfm_group(sotuDfm, "party") %>%
    textstat_keyness(measure = "chi2")
head(chi2vals)
#   feature     chi2 p n_target n_reference
# 1       - 221.6249 0     2418        1645
# 2  mexico 181.0586 0      505         182
# 3    bank 164.9412 0      283          60
# 4       " 148.6333 0     1265         800
# 5 million 132.3267 0      366         131
# 6   texas 101.1991 0      174          37

在删除 chi^2 分数的符号后,此信息可用于 select 最具鉴别力的特征。

# remove sign
chi2vals$chi2 <- abs(chi2vals$chi2)
# sort
chi2vals <- chi2vals[order(chi2vals$chi2, decreasing = TRUE), ]
head(chi2vals)
#          feature     chi2 p n_target n_reference
# 1              - 221.6249 0     2418        1645
# 29044 commission 190.3010 0      175         588
# 2         mexico 181.0586 0      505         182
# 3           bank 164.9412 0      283          60
# 4              " 148.6333 0     1265         800
# 29043        law 137.8330 0      607        1178


dfmTop100cs <- dfm_select(sotuDfm, chi2vals$feature)
## kept 100 features, from 100 supplied (glob) feature types

head(dfmTop100cs, nf = 6)
Document-feature matrix of: 6 documents, 6 features (0% sparse).
6 x 6 sparse Matrix of class "dfm"
              features
docs           fellow citizens senate house representatives :
  Jackson-1829      5       17      2     3               5 1
  Jackson-1830      6       14      4     6               9 3
  Jackson-1831      9       21      3     1               4 1
  Jackson-1832      6       17      4     1               2 1
  Jackson-1833      2       14      7     4               6 1
  Jackson-1834      3       24      5     1               3 5