自变量是分类变量且目标变量是分类变量时的特征选择

Feature selection when independent variables are categorical and also target variable is categorical

我展示了我正在处理的数据集的一小部分样本。我的原始数据集有 'Symptoms' 的 400 列和 'Disease' 的 1 列。从这里 the output expected is to find out the top 'N' maybe 10 or some number of 'Symptoms' which are most significant for a particular disease. 我的示例数据集如下:

fever    headche     sore throat          drowsiness               Disease
    0        0         1                   0                      Fungal infection
    0        0         0                   1                      Fungal infection
    0        1         0                   0                      liver infection
    1        0         0                   1                      diarrhoea
    0        0         1                   1                      common cold
    0        1         1                   0                      diarrhoea
    1        0         0                   0                      flu
    

我尝试过使用 sklearn 的 SelectKBest 但无法理解结果。还想知道熊猫的dataframe.corr函数在这种情况下是否可以工作

解决此问题的一种方法是使用朴素贝叶斯分类器,其特征概率建模为 Bernoulli distributions。这假设目标变量不是您在问题中提到的分类变量,而只是二元变量。我认为这是一个更合理的假设,在我看来,这是从输入数据的构造中得出的,其中输入变量似乎是二进制的。

第一个模型传递可以是以下内容(从这个 answer 改编 important_features 函数:

import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB

def important_features(classifier,feature_names, n=20):
    class_labels = classifier.classes_

    for i,feature in enumerate(feature_names): 
        print("Important features in ", class_labels[i])
        topn_class = sorted(zip(classifier.feature_log_prob_[i], feature_names),
                            reverse=True)[:n]
        
        for coef, feat in topn_class:
            print(coef, feat)
        print('-----------------------')

d = {}
d['fever'] = np.array([0,0,0,1,0,0,1])
d['headache'] = np.array([0,0,1,0,0,1,0])
d['sorethroat'] = np.array([1,0,0,0,1,1,0])
d['drowsiness'] = np.array([0,1,0,1,1,0,0])
d['disease'] = ['Fungal infection','Fungal infection','liver infection',
           'diarrhoea','common cold','diarrhoea','flu']

df = pd.DataFrame(d)

X = df[df.columns[:-1]]
y = df['disease']

clf = BernoulliNB()
clf.fit(X, y)
BernoulliNB()

important_features(clf,df.columns[:-1])

这应该会为您提供以下输出,这当然只是为了演示目的,因为我只使用了您在上面提供的数据:

Important features in  Fungal infection
-0.6931471805599453 sorethroat
-0.6931471805599453 drowsiness
-1.3862943611198906 headache
-1.3862943611198906 fever
-----------------------
Important features in  common cold
-0.4054651081081645 sorethroat
-0.4054651081081645 drowsiness
-1.0986122886681098 headache
-1.0986122886681098 fever
-----------------------
Important features in  diarrhoea
-0.6931471805599453 sorethroat
-0.6931471805599453 headache
-0.6931471805599453 fever
-0.6931471805599453 drowsiness
-----------------------
Important features in  flu
-0.4054651081081645 fever
-1.0986122886681098 sorethroat
-1.0986122886681098 headache
-1.0986122886681098 drowsiness
-----------------------

朴素贝叶斯当然不考虑自变量之间的相关性,例如如果他们无论如何都发烧并且与潜在的疾病无关,那么他们可能更容易患头痛。如果此限制对您来说不是问题,那么您可以继续 运行 所有数据的模型。请注意,训练一个更通用的模型来估计数据中所有可能的相关性可能真的很难。

最后请注意,pandas corr 方法将为您提供自变量的相关性,但它与根据输入预测疾病的模型没有任何关系。