IndexError: index 1967 is out of bounds for axis 0 with size 1967

Question

通过计算 p 值，我减少了大型稀疏文件中的特征数量。但是我得到了这个错误。我看过类似的帖子，但这段代码适用于非稀疏输入。你能帮忙吗？（如果需要我可以上传输入文件）

import statsmodels.formula.api as sm

def backwardElimination(x, Y, sl, columns):
    numVars = len(x[0])
    pvalue_removal_counter = 0

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()
        maxVar = max(regressor_OLS.pvalues).astype(float)

        if maxVar > sl:
            for j in range(0, numVars - i):
                if (regressor_OLS.pvalues[j].astype(float) == maxVar):
                    x = np.delete(x, j, 1)
                    pvalue_removal_counter += 1
                    columns = np.delete(columns, j)

    regressor_OLS.summary()
    return x, columns

输出：

0 of 1970
1 of 1970
2 of 1970
Traceback (most recent call last):
  File "main.py", line 142, in <module>
    selected_columns)
  File "main.py", line 101, in backwardElimination
    if (regressor_OLS.pvalues[j].astype(float) == maxVar):
IndexError: index 1967 is out of bounds for axis 0 with size 1967

Answer 1

这里是固定版本。

我做了一些修改：

从 statsmodels.api

OLS

在函数

columns

用np.argmax求最大值的位置
对 select 列使用布尔索引。在伪代码中，它就像 x[:, [True, False, True]] 保留第 0 列和第 2 列。
如果没有什么可放下的就停止。

import numpy as np
# Wrong import. Not using the formula interface, so using statsmodels.api
import statsmodels.api as sm

def backwardElimination(x, Y, sl):
    numVars = x.shape[1]  # variables in columns
    columns = np.arange(numVars)

    for i in range(0, numVars):
        print(i, 'of', numVars)
        regressor_OLS = sm.OLS(Y, x).fit()

        if maxVar > sl:
            # Use boolean selection
            retain = np.ones(x.shape[1], bool)
            drop = np.argmax(regressor_OLS.pvalues)
            # Drop the highest pvalue(s)
            retain[drop] = False
            # Keep the x we with to retain
            x = x[:, retain]
            # Also keep their column indices
            columns = columns[retain]
        else:
            # Exit early if everything has pval above sl
            break

    # Show the final summary
    print(regressor_OLS.summary())
    return x, columns

您可以使用

进行测试

x = np.random.standard_normal((1000,100))
y = np.random.standard_normal(1000)
backwardElimination(x,y,0.1)

IndexError: index 1967 is out of bounds for axis 0 with size 1967

IndexError: index 1967 is out of bounds for axis 0 with size 1967

python

numpy

statsmodels

p-value

index-error