为什么我收到 GroupShuffleSplit(训练测试拆分)的错误
Why I am getting the error for GroupShuffleSplit (train test split)
我有 2 个数据集并应用了 5 个不同的 ML 模型。
数据集 1:
def dataset_1():
...
...
bike_data_hours = bike_data_hours[:500]
X = bike_data_hours.iloc[:, :-1].values
y = bike_data_hours.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, X_test, y_train.reshape(-1, 1), y_test.reshape(-1, 1)
形状是(400, 14) (100, 14) (400, 1) (100, 1)
。 dtypes: object
(int64, float64).
数据集 2:
def dataset_2():
...
...
final_movie_df = final_movie_df[:500]
X = final_movie_df.iloc[:, :-1]
y = final_movie_df.iloc[:, -1]
gs = GroupShuffleSplit(n_splits=2, test_size=0.2)
train_ix, test_ix = next(gs.split(X, y, groups=X.UserID))
X_train = X.iloc[train_ix]
y_train = y.iloc[train_ix]
X_test = X.iloc[test_ix]
y_test = y.iloc[test_ix]
return X_train.shape, X_test.shape, y_train.values.reshape(-1,1).shape, y_test.values.reshape(-1,1).shape
形状是(400, 25) (100, 25) (400, 1) (100, 1)
。 dtypes: object
(int64, float64).
我正在使用不同的模型。密码是
X_train, X_test, y_train, y_test = dataset
fold_residuals, fold_dfs = [], []
kf = KFold(n_splits=k, shuffle=True)
for train_index, _ in kf.split(X_train):
if reg_name == "RF" or reg_name == "SVR":
preds = regressor.fit(X_train[train_index], y_train[train_index].ravel()).predict(X_test)
elif reg_name == "Knn-5":
preds = regressor.fit(X_train[train_index], np.ravel(y_train[train_index], order="C")).predict(X_test)
else:
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
但是我遇到了 , , and 这样的常见错误。我浏览了所有这些帖子,但对错误一无所知。我使用 iloc
和 values
作为访问链接的解决方案。
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3030, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([ 0, 1, 3, 4, 5, 6, 7, 9, 10, 11,\n ...\n 387, 388, 389, 390, 391, 392, 393, 395, 397, 399],\n dtype='int64', length=320)] are in the [columns]"
在这里,如果我使用 train_test_split
而不是 GroupShuffleSplit
那么代码就可以工作了。但是,我想在 UserID
的基础上使用 GroupShuffleSplit
,这样同一用户就不会同时进行训练和测试。你能告诉我如何解决这个问题,而我将使用 GroupShuffleSplit
?
你能告诉我为什么 dataset_2
出现错误,而 dataset_1
工作正常(shape
和 dtypes
)对于两个数据集。
您的 dataset_2 必须使用 values
。进行更改
X_train = X.iloc[train_ix].values
y_train = y.iloc[train_ix].values
X_test = X.iloc[test_ix].values
y_test = y.iloc[test_ix].values
return X_train.shape, X_test.shape, y_train.reshape(-1,1).shape, y_test.reshape(-1,1).shape
希望现在能起作用
我有 2 个数据集并应用了 5 个不同的 ML 模型。
数据集 1:
def dataset_1():
...
...
bike_data_hours = bike_data_hours[:500]
X = bike_data_hours.iloc[:, :-1].values
y = bike_data_hours.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, X_test, y_train.reshape(-1, 1), y_test.reshape(-1, 1)
形状是(400, 14) (100, 14) (400, 1) (100, 1)
。 dtypes: object
(int64, float64).
数据集 2:
def dataset_2():
...
...
final_movie_df = final_movie_df[:500]
X = final_movie_df.iloc[:, :-1]
y = final_movie_df.iloc[:, -1]
gs = GroupShuffleSplit(n_splits=2, test_size=0.2)
train_ix, test_ix = next(gs.split(X, y, groups=X.UserID))
X_train = X.iloc[train_ix]
y_train = y.iloc[train_ix]
X_test = X.iloc[test_ix]
y_test = y.iloc[test_ix]
return X_train.shape, X_test.shape, y_train.values.reshape(-1,1).shape, y_test.values.reshape(-1,1).shape
形状是(400, 25) (100, 25) (400, 1) (100, 1)
。 dtypes: object
(int64, float64).
我正在使用不同的模型。密码是
X_train, X_test, y_train, y_test = dataset
fold_residuals, fold_dfs = [], []
kf = KFold(n_splits=k, shuffle=True)
for train_index, _ in kf.split(X_train):
if reg_name == "RF" or reg_name == "SVR":
preds = regressor.fit(X_train[train_index], y_train[train_index].ravel()).predict(X_test)
elif reg_name == "Knn-5":
preds = regressor.fit(X_train[train_index], np.ravel(y_train[train_index], order="C")).predict(X_test)
else:
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
但是我遇到了 iloc
和 values
作为访问链接的解决方案。
preds = regressor.fit(X_train[train_index], y_train[train_index]).predict(X_test)
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 3030, in __getitem__
indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
File "/home/fgd/.local/lib/python3.8/site-packages/pandas/core/indexing.py", line 1308, in _validate_read_indexer
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Int64Index([ 0, 1, 3, 4, 5, 6, 7, 9, 10, 11,\n ...\n 387, 388, 389, 390, 391, 392, 393, 395, 397, 399],\n dtype='int64', length=320)] are in the [columns]"
在这里,如果我使用 train_test_split
而不是 GroupShuffleSplit
那么代码就可以工作了。但是,我想在 UserID
的基础上使用 GroupShuffleSplit
,这样同一用户就不会同时进行训练和测试。你能告诉我如何解决这个问题,而我将使用 GroupShuffleSplit
?
你能告诉我为什么 dataset_2
出现错误,而 dataset_1
工作正常(shape
和 dtypes
)对于两个数据集。
您的 dataset_2 必须使用 values
。进行更改
X_train = X.iloc[train_ix].values
y_train = y.iloc[train_ix].values
X_test = X.iloc[test_ix].values
y_test = y.iloc[test_ix].values
return X_train.shape, X_test.shape, y_train.reshape(-1,1).shape, y_test.reshape(-1,1).shape
希望现在能起作用