分区数据集时出错

Question

我想将数据集划分为训练集和交叉验证集。我就是这样做的。 train 是 pandas DataFrame。

import numpy as np

#...
features = ['season', 'holiday', 'workingday', 'weather',
        'temp', 'atemp', 'humidity', 'windspeed', 'year',
        'month', 'weekday', 'hour']

train = pd.read_csv('data/train.csv', parse_dates=[0])

np.random.shuffle(train)
training, crossvalidation = train[:0.8*len(train),features], train[0.8*len(train):,features]

此代码给出以下错误：

Traceback (most recent call last):
  File "D:/Web/PyCharm/linear_regression.py", line 47, in <module>
    np.random.shuffle(train)
  File "mtrand.pyx", line 4607, in mtrand.RandomState.shuffle (numpy\random\mtrand\mtrand.c:25420)
  File "mtrand.pyx", line 4610, in mtrand.RandomState.shuffle (numpy\random\mtrand\mtrand.c:25361)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1791, in __getitem__
    return self._getitem_column(key)
  File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 1798, in _getitem_column
    return self._get_item_cache(key)
  File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 1084, in _get_item_cache
    values = self._data.get(item)
  File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 2851, in get
    loc = self.items.get_loc(item)
  File "C:\Python27\lib\site-packages\pandas\core\index.py", line 1578, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas\index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas\index.c:3811)
  File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:3691)
  File "pandas\hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12336)
  File "pandas\hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12287)
KeyError: 8953

这是train.head()

的结果

             datetime  season  holiday  workingday  weather  temp   atemp  \
0 2011-01-01 00:00:00       1        0           0        1  9.84  14.395   
1 2011-01-01 01:00:00       1        0           0        1  9.02  13.635   
2 2011-01-01 02:00:00       1        0           0        1  9.02  13.635   
3 2011-01-01 03:00:00       1        0           0        1  9.84  14.395   
4 2011-01-01 04:00:00       1        0           0        1  9.84  14.395   

   humidity  windspeed  casual  registered  count  year  month  hour  weekday  \
0        81          0       3          13     16  2011      1     0        5   
1        80          0       8          32     40  2011      1     1        5   
2        80          0       5          27     32  2011      1     2        5   
3        75          0       3          10     13  2011      1     3        5   
4        75          0       0           1      1  2011      1     4        5   

   log-casual  log-registered  log-count  
0    1.386294        2.639057   2.833213  
1    2.197225        3.496508   3.713572  
2    1.791759        3.332205   3.496508  
3    1.386294        2.397895   2.639057  
4    0.000000        0.693147   0.693147

Answer 1

你的问题来自np.random.shuffle(train) 您需要 np.random.shuffle(train.values) 而不是

另一方面，您不能使用浮点数进行切片。你需要用 int.

分区数据集时出错

Error on partitioning the data set

python

partitioning

numpy

pandas