一个奇怪的 keyerror 阻止我测试我的 logit 回归分类器?
A strange keyerror is preventing me from testing my logit regression classifier?
我正在尝试 运行 在 for 循环中从 python 中的 statsmodel 进行逻辑回归。因此,我每次都将测试数据中的一行附加到我的训练数据数据框中,然后重新运行回归并存储结果。
现在,有趣的是,测试数据没有正确附加(我认为这是导致我得到 KeyError:0 的原因,但在这里征求您的意见)。我已经尝试导入两个版本的测试数据——一个与训练数据具有相同的标签,另一个没有声明的标签。
这是我的代码:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime
df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))
train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()
print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())
#appnd test data
print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")
iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []
df_train.to_pickle('train_iteration.pickle')
print(df_test.head())
print("Loop begins")
for row in range(0,len(df_test)):
start_time = datetime.datetime.now()
print("Loop iteration ", row, " in ", len(df_test), " rows")
df_train = pd.read_pickle('train_iteration.pickle')
print("pickle read")
df_train.append(df_test[row])
print("row ", row, " appended")
train_cols = df_train.columns[1:]
print("X variables extracted in new DataFrame")
logit = sm.Logit(df_train['Income'], df_train[train_cols])
print("Def logit reg eqn")
result = logit.fit()
print("fit logit reg eqn")
iteration_result[row] = result.summary()
print("logit result summary stored in array")
iteration_params[row] = result.params
print("logit params stored in array")
iteration_conf_int[row] = result.conf_int()
print("logit conf_int stored in array")
df_train.to_pickle('train_iteration.pickle')
print("exported to pickle")
end_time = datetime.datetime.now()
time_diff = start_time - end_time
print("time for this iteration is ", time_diff)
iteration_time[row] = time_diff
print("ending iteration, starting next iteration of loop...")
print("Loop ends")
pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())
打印到这里:
Loop iteration 0 in 15060 rows
pickle read
但随后生成 KeyError: 0
我做错了什么?
具有与训练数据匹配的标签的测试数据版本:
Income Age Workclass Education Marital_Status Occupation \
0 0 1 4 7 4 6
1 0 1 4 9 2 4
2 1 1 6 12 2 10
3 1 1 4 10 2 6
4 0 1 4 6 4 7
Relationship Race Sex Capital_gain Capital_loss Hours_per_week
0 3 2 0 0 0 40
1 0 4 0 0 0 50
2 0 4 0 0 0 40
3 0 2 0 7688 0 40
4 1 4 0 0 0 30
没有标签的测试数据版本:
0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40
0 0 1 4 9 2 4 0 4 0 0 0 50
1 1 1 6 12 2 10 0 4 0 0 0 40
2 1 1 4 10 2 6 0 2 0 7688 0 40
3 0 1 4 6 4 7 1 4 0 0 0 30
4 1 2 2 15 2 9 0 4 0 3103 0 32
在这两种情况下,如果我使用标记或未标记的训练数据,我仍然会在同一点出现相同的错误。
有人指导我如何最好地进行吗?
更新:这是完整的错误信息(前三行是打印语句,错误从第四行开始):
Loop begins
Loop iteration 0 in 15060 rows
pickle read
Traceback (most recent call last):
File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
df_train.append(df_test[row])
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 0
更新:
我在 print(df_train.std()) 语句的最后一行,在所有列的 std dev 之后得到这个。 dtype: float64
所以,我猜我的训练数据框被视为浮点数。
我想我明白了...而不是下面的代码 -
df_train.append(df_test[row])
print("row ", row, " appended")
改写为 -
df_train.append(df_test.iloc[row])
df_train = df_train.reset_index()
print("row ", row, " appended")
让我知道这是否达到目的...每次重置索引都是必不可少的...不过只有一件事 - 如果您的测试集相当大,这将是一场计算灾难,对每个人进行培训测试中看到的数据点...
只是上下文之外的一条建议 - 如果您确实想近乎实时地训练它,只需尝试使用测试集的批次或块...
我正在尝试 运行 在 for 循环中从 python 中的 statsmodel 进行逻辑回归。因此,我每次都将测试数据中的一行附加到我的训练数据数据框中,然后重新运行回归并存储结果。
现在,有趣的是,测试数据没有正确附加(我认为这是导致我得到 KeyError:0 的原因,但在这里征求您的意见)。我已经尝试导入两个版本的测试数据——一个与训练数据具有相同的标签,另一个没有声明的标签。
这是我的代码:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import datetime
df_train = pd.read_csv('Adult-Incomes/train-labelled-final-variables-condensed-coded-countries-removed-unlabelled-income-to-the-left-relabelled-copy.csv')
print('Training set')
print(df_train.head(15))
train_cols = df_train.columns[1:]
logit = sm.Logit(df_train['Income'], df_train[train_cols])
result = logit.fit()
print("ODDS RATIO")
print(result.params)
print("RESULTS SUMMARY")
print(result.summary())
print("CONFIDENCE INTERVAL")
print(result.conf_int())
#appnd test data
print("PREDICTION PROCESS")
print("READING TEST DATA")
df_test = pd.read_csv('Adult-Incomes/test-final-variables-cleaned-coded-copy-relabelled.csv')
print("TEST DATA READ COMPLETE")
iteration_time = []
iteration_result = []
iteration_params = []
iteration_conf_int = []
df_train.to_pickle('train_iteration.pickle')
print(df_test.head())
print("Loop begins")
for row in range(0,len(df_test)):
start_time = datetime.datetime.now()
print("Loop iteration ", row, " in ", len(df_test), " rows")
df_train = pd.read_pickle('train_iteration.pickle')
print("pickle read")
df_train.append(df_test[row])
print("row ", row, " appended")
train_cols = df_train.columns[1:]
print("X variables extracted in new DataFrame")
logit = sm.Logit(df_train['Income'], df_train[train_cols])
print("Def logit reg eqn")
result = logit.fit()
print("fit logit reg eqn")
iteration_result[row] = result.summary()
print("logit result summary stored in array")
iteration_params[row] = result.params
print("logit params stored in array")
iteration_conf_int[row] = result.conf_int()
print("logit conf_int stored in array")
df_train.to_pickle('train_iteration.pickle')
print("exported to pickle")
end_time = datetime.datetime.now()
time_diff = start_time - end_time
print("time for this iteration is ", time_diff)
iteration_time[row] = time_diff
print("ending iteration, starting next iteration of loop...")
print("Loop ends")
pd.DataFrame(iteration_result)
pd.DataFrame(iteration_time)
print (iteration_result.head())
print (iteration_time.head())
打印到这里:
Loop iteration 0 in 15060 rows
pickle read
但随后生成 KeyError: 0
我做错了什么?
具有与训练数据匹配的标签的测试数据版本:
Income Age Workclass Education Marital_Status Occupation \
0 0 1 4 7 4 6
1 0 1 4 9 2 4
2 1 1 6 12 2 10
3 1 1 4 10 2 6
4 0 1 4 6 4 7
Relationship Race Sex Capital_gain Capital_loss Hours_per_week
0 3 2 0 0 0 40
1 0 4 0 0 0 50
2 0 4 0 0 0 40
3 0 2 0 7688 0 40
4 1 4 0 0 0 30
没有标签的测试数据版本:
0 1 4 7 4.1 6 3 2 0.1 0.2 0.3 40
0 0 1 4 9 2 4 0 4 0 0 0 50
1 1 1 6 12 2 10 0 4 0 0 0 40
2 1 1 4 10 2 6 0 2 0 7688 0 40
3 0 1 4 6 4 7 1 4 0 0 0 30
4 1 2 2 15 2 9 0 4 0 3103 0 32
在这两种情况下,如果我使用标记或未标记的训练数据,我仍然会在同一点出现相同的错误。
有人指导我如何最好地进行吗?
更新:这是完整的错误信息(前三行是打印语句,错误从第四行开始):
Loop begins
Loop iteration 0 in 15060 rows
pickle read
Traceback (most recent call last):
File "<ipython-input-10-1f56d5243e43>", line 1, in <module>
runfile('/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py', wdir='/media/deepak/Laniakea/Projects/Training/SPYDER/classifier')
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "/usr/local/lib/python3.5/dist-packages/spyder/utils/site/sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/media/deepak/Laniakea/Projects/Training/SPYDER/classifier/classifier_test2.py", line 64, in <module>
df_train.append(df_test[row])
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 3541, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.5/dist-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 139, in pandas.index.IndexEngine.get_loc (pandas/index.c:4443)
File "pandas/index.pyx", line 161, in pandas.index.IndexEngine.get_loc (pandas/index.c:4289)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13733)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13687)
KeyError: 0
更新:
我在 print(df_train.std()) 语句的最后一行,在所有列的 std dev 之后得到这个。 dtype: float64
所以,我猜我的训练数据框被视为浮点数。
我想我明白了...而不是下面的代码 -
df_train.append(df_test[row])
print("row ", row, " appended")
改写为 -
df_train.append(df_test.iloc[row])
df_train = df_train.reset_index()
print("row ", row, " appended")
让我知道这是否达到目的...每次重置索引都是必不可少的...不过只有一件事 - 如果您的测试集相当大,这将是一场计算灾难,对每个人进行培训测试中看到的数据点...
只是上下文之外的一条建议 - 如果您确实想近乎实时地训练它,只需尝试使用测试集的批次或块...