我怎样才能删除低相关的特征

Question

我正在为我的 LSTM 训练制作预处理代码。我的 csv 包含 30 多个变量。在应用了一些EDA技术后，我发现可以丢弃一半的特征，它们对训练没有任何影响。

现在我正在使用 pandas 手动删除这些功能。

我想编写一个可以自动删除这些功能的代码。我写了一段代码以这种方式可视化热图和相关性：

#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data

def calculateCorrelationByPearson(self):
        columns = self.data.columns
        plt.figure(figsize=(12, 8))
        sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f', 
                      linewidths=0.5, cmap='Blues')
        plt.show()
        for column in columns:
            corr = stats.spearmanr(self.data['total'], self.data[columns])
            print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')

这让我对自己的特征和彼此之间的关系有了一个完美的认识。

现在我想删除不重要的列。假设相关性小于 0.4。

如何将此逻辑应用到我的代码中？

Answer 1

这是一种删除相关系数值低于某个阈值的变量的方法：

import pandas as pd
from scipy.stats import spearmanr

data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4

corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold

vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar)  #add the target variable back in
data2 = data[vars_to_keep]

我怎样才能删除低相关的特征

how can I drop low correlated features

python

statistics

machine-learning

correlation

python-3.x