如何运行在 pandas 数据帧上进行多重共线性测试？

Question

我对 Python、Stats 和使用 DS 库比较陌生，我的要求是运行对具有 n 个列的数据集进行多重共线性测试，并确保 columns/variables VIF > 5 被完全丢弃。

我找到了一个代码，

 from statsmodels.stats.outliers_influence import variance_inflation_factor

    def calculate_vif_(X, thresh=5.0):

        variables = range(X.shape[1])
        tmp = range(X[variables].shape[1])
        print(tmp)
        dropped=True
        while dropped:
            dropped=False
            vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]

            maxloc = vif.index(max(vif))
            if max(vif) > thresh:
                print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                del variables[maxloc]
                dropped=True

        print('Remaining variables:')
        print(X.columns[variables])
        return X[variables]

但是，我不太明白，我是否应该在X参数的位置完全传递数据集？如果是，则它不起作用。

请帮忙！

Answer 1

我也有类似的问题运行。我通过更改 variables 的定义方式并找到另一种删除其元素的方法来修复它。

以下脚本应该适用于 Anaconda 5.0.1 和 Python 3.6（撰写本文时的最新版本）。

import numpy as np
import pandas as pd
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor    
from joblib import Parallel, delayed

# Defining the function that you will run later
def calculate_vif_(X, thresh=5.0):
    variables = [X.columns[i] for i in range(X.shape[1])]
    dropped=True
    while dropped:
        dropped=False
        print(len(variables))
        vif = Parallel(n_jobs=-1,verbose=5)(delayed(variance_inflation_factor)(X[variables].values, ix) for ix in range(len(variables)))

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print(time.ctime() + ' dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            variables.pop(maxloc)
            dropped=True

    print('Remaining variables:')
    print([variables])
    return X[[i for i in variables]]

X = df[feature_list] # Selecting your data

X2 = calculate_vif_(X,5) # Actually running the function

如果您有很多功能，那么运行将花费很长时间。所以我做了另一项更改以使其并行工作，以防您有多个 CPU 可用。

尽情享受吧！

Answer 2

我调整了代码并设法通过以下代码实现了预期的结果，并进行了一些异常处理 -

def multicollinearity_check(X, thresh=5.0):
    data_type = X.dtypes
    # print(type(data_type))
    int_cols = \
    X.select_dtypes(include=['int', 'int16', 'int32', 'int64', 'float', 'float16', 'float32', 'float64']).shape[1]
    total_cols = X.shape[1]
    try:
        if int_cols != total_cols:
            raise Exception('All the columns should be integer or float, for multicollinearity test.')
        else:
            variables = list(range(X.shape[1]))
            dropped = True
            print('''\n\nThe VIF calculator will now iterate through the features and calculate their respective values.
            It shall continue dropping the highest VIF features until all the features have VIF less than the threshold of 5.\n\n''')
            while dropped:
                dropped = False
                vif = [variance_inflation_factor(X.iloc[:, variables].values, ix) for ix in variables]
                print('\n\nvif is: ', vif)
                maxloc = vif.index(max(vif))
                if max(vif) > thresh:
                    print('dropping \'' + X.iloc[:, variables].columns[maxloc] + '\' at index: ' + str(maxloc))
                    # del variables[maxloc]
                    X.drop(X.columns[variables[maxloc]], 1, inplace=True)
                    variables = list(range(X.shape[1]))
                    dropped = True

            print('\n\nRemaining variables:\n')
            print(X.columns[variables])
            # return X.iloc[:,variables]
            return X
    except Exception as e:
        print('Error caught: ', e)

Answer 3

首先，感谢 @DanSan for including the idea of Parallelization in Multicollinearity computation. Now I have an at least 50% improvement in the computation time for a multi-dimensional dataset of shape (22500, 71). But I have faced one interesting challenge on a dataset I was working on. The dataset actually contains some categorical columns, which I have Binary encoded using Category-encoders，因此某些列仅获得了 1 个唯一值。对于此类列，VIF 的值是非有限的或 NaN !

以下快照显示了我数据集中 71 个二进制编码列中某些列的 VIF 值：

在这些情况下，使用 @Aakash Basu and @DanSan 的代码后将保留的列数有时可能取决于数据集中列的顺序，根据我的痛苦经验，因为列被删除线性基于最大 VIF 值。 只有一个值的列对于任何机器学习模型来说都有点愚蠢，因为它会强行将偏差强加到系统中！

为了处理这个问题，您可以使用以下更新代码：

from joblib import Parallel, delayed
from statsmodels.stats.outliers_influence import variance_inflation_factor

def removeMultiColl(data, vif_threshold = 5.0):
    for i in data.columns:
        if data[i].nunique() == 1:
            print(f"Dropping {i} due to just 1 unique value")
            data.drop(columns = i, inplace = True)
    drop = True
    col_list = list(data.columns)
    while drop == True:
        drop = False
        vif_list = Parallel(n_jobs = -1, verbose = 5)(delayed(variance_inflation_factor)(data[col_list].values, i) for i in range(data[col_list].shape[1]))
        max_index = vif_list.index(max(vif_list))
        if vif_list[max_index] > vif_threshold:
            print(f"Dropping column : {col_list[max_index]} at index - {max_index}")
            del col_list[max_index]
            drop = True
    print("Remaining columns :\n", list(data[col_list].columns))
    return data[col_list]

祝你好运！

如何运行在 pandas 数据帧上进行多重共线性测试？

How to run a multicollinearity test on a pandas dataframe?

pandas

statsmodels

python-3.6

如何 运行 在 pandas 数据帧上进行多重共线性测试？

How to run a multicollinearity test on a pandas dataframe?

pandas

statsmodels

python-3.6

如何运行在 pandas 数据帧上进行多重共线性测试？