一个奇怪的 keyerror 阻止我测试我的 logit 回归分类器?

A strange keyerror is preventing me from testing my logit regression classifier?

我正在尝试 运行 在 for 循环中从 python 中的 statsmodel 进行逻辑回归。因此,我每次都将测试数据中的一行附加到我的训练数据数据框中,然后重新运行回归并存储结果。

现在,有趣的是,测试数据没有正确附加(我认为这是导致我得到 KeyError:0 的原因,但在这里征求您的意见)。我已经尝试导入两个版本的测试数据——一个与训练数据具有相同的标签,另一个没有声明的标签。

这是我的代码:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime 

df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))

train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()

print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())

#appnd test data

print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")

iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []

df_train.to_pickle('train_iteration.pickle')
print(df_test.head())

print("Loop begins")

for row in range(0,len(df_test)):
    start_time = datetime.datetime.now()
    print("Loop iteration ", row, " in ", len(df_test), " rows")

    df_train = pd.read_pickle('train_iteration.pickle')
    print("pickle read")
    df_train.append(df_test[row])
    print("row ", row, " appended")
    train_cols = df_train.columns[1:]
    print("X variables extracted in new DataFrame")
    logit = sm.Logit(df_train['Income'], df_train[train_cols])
    print("Def logit reg eqn")
    result = logit.fit()
    print("fit logit reg eqn")
    iteration_result[row] = result.summary()
    print("logit result summary stored in array")
    iteration_params[row] = result.params
    print("logit params stored in array")
    iteration_conf_int[row] = result.conf_int()
    print("logit conf_int stored in array")

    df_train.to_pickle('train_iteration.pickle')
    print("exported to pickle")

    end_time = datetime.datetime.now()
    time_diff = start_time - end_time
    print("time for this iteration is ", time_diff)
    iteration_time[row] = time_diff
    print("ending iteration, starting next iteration of loop...")

print("Loop ends")

pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())

打印到这里:

Loop iteration  0  in  15060  rows
pickle read

但随后生成 KeyError: 0

我做错了什么?

具有与训练数据匹配的标签的测试数据版本:

   Income  Age  Workclass  Education  Marital_Status  Occupation  \
0       0    1          4          7               4           6   
1       0    1          4          9               2           4   
2       1    1          6         12               2          10   
3       1    1          4         10               2           6   
4       0    1          4          6               4           7   

   Relationship  Race  Sex  Capital_gain  Capital_loss  Hours_per_week  
0             3     2    0             0             0              40  
1             0     4    0             0             0              50  
2             0     4    0             0             0              40  
3             0     2    0          7688             0              40  
4             1     4    0             0             0              30  

没有标签的测试数据版本:

   0  1  4   7  4.1   6  3  2  0.1   0.2  0.3  40
0  0  1  4   9    2   4  0  4    0     0    0  50
1  1  1  6  12    2  10  0  4    0     0    0  40
2  1  1  4  10    2   6  0  2    0  7688    0  40
3  0  1  4   6    4   7  1  4    0     0    0  30
4  1  2  2  15    2   9  0  4    0  3103    0  32

在这两种情况下,如果我使用标记或未标记的训练数据,我仍然会在同一点出现相同的错误。

有人指导我如何最好地进行吗?

更新:这是完整的错误信息(前三行是打印语句,错误从第四行开始):

Loop begins
Loop iteration  0  in  15060  rows
pickle read
Traceback (most recent call last):

  File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
    runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
    df_train.append(df_test[row])

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)

  File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
    loc = self.items.get_loc(item)

  File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)

  File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)

  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)

  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)

KeyError: 0

更新: 我在 print(df_train.std()) 语句的最后一行,在所有列的 std dev 之后得到这个。 dtype: float64 所以,我猜我的训练数据框被视为浮点数。

我想我明白了...而不是下面的代码 -

df_train.append(df_test[row])
print("row ", row, " appended")

改写为 -

df_train.append(df_test.iloc[row])
df_train = df_train.reset_index()
print("row ", row, " appended")

让我知道这是否达到目的...每次重置索引都是必不可少的...不过只有一件事 - 如果您的测试集相当大,这将是一场计算灾难,对每个人进行培训测试中看到的数据点...

只是上下文之外的一条建议 - 如果您确实想近乎实时地训练它,只需尝试使用测试集的批次或块...