使用 KFolds 拆分数据框:我希望拆分行,但拆分的是列
Use KFolds to split dataframe: I want the rows to be split but the columns are getting split instead
我有一个 466 x 1025 的数据框。 1024 个变量和目标组成列。我在数据集上使用随机森林回归,并尝试使用折叠来获得更一致的预测。我的目标是正确拆分,但是当应用于数据时,拆分的是列而不是行。我得到 466 x 372 的训练数据和 466 x 94 的测试数据。我需要 372 x 1024 的训练数据和 94 x 1024 的测试数据。我该如何解决这个问题?注意:当我使用 train_test_split()
时它确实可以正常工作
代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
#read the data files, verify types
df = pd.read_csv('./allMolecules.csv')
#the data frame is ready, now it's time for the random forest.
#split data into train and test
xTrain, xTest, yTrain, yTest = train_test_split(finalDF.drop(['Target'], axis=1), finalDF['Target'],test_size=0.2)
model = RandomForestRegressor(n_estimators=1000)
output = model.fit(xTrain,yTrain)
score = model.score(xTest,yTest)
print('Model Settings:\n{0}\n'.format(output))
print('R2: {0}'.format(score))
folds = KFold(n_splits=5)
scores = []
data = finalDF.drop(['Target'], axis=1)
for trainIndex, testIndex in folds.split(finalDF.drop(['Target'], axis=1)):
print(trainIndex, testIndex)
xTrain = data[trainIndex]
xTest = (finalDF.drop(['Target'], axis=1))[testIndex]
yTrain = finalDF['Target'][trainIndex]
yTest = finalDF['Target'][testIndex]
print('\n\n{0}\n\n{1}\n\n{2}\n\n{3}'.format(xTrain,xTest,yTrain,yTest))
output = model.fit(xTrain, yTrain)
scores.append(model.score(xTest, yTest))
print(scores)
我认为您有时可能会弄错索引。 KFold
仅在第一个轴上拆分。
尝试保持简单,在从 folds.split
的结果进行索引之前拆分为 X
和 y
,并改为使用数组:
X = finalDF.drop(['Target'], axis=1).values
y = finalDF.target.values
for trainIndex, testIndex in folds.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
如果您有兴趣使用 pandas 数据框,那么您的问题的解决方案如下所示:
import pandas as pd
from sklearn.model_selection import KFold
X = [[ 0.87, -1.34, 0.31, 0],
[-2.79, -0.02, -0.85, 1],
[-1.34, -0.48, -2.55, 0],
[ 1.92, 1.48, 0.65, 1]]
finalDF = pd.DataFrame(X * 20, columns=['col1', 'col2', 'col3', 'Target'])
folds = KFold(n_splits=5)
scores = []
for trainIndex, testIndex in folds.split(finalDF.drop(['Target'], axis=1)):
# print(trainIndex, testIndex)
xTrain = finalDF.loc[trainIndex, :]
xTest = finalDF.loc[testIndex, :]
print(xTrain.shape, xTest.shape)
对于此示例,您将获得输出(在打印中)
(64, 4) (16, 4)
(64, 4) (16, 4)
(64, 4) (16, 4)
(64, 4) (16, 4)
(64, 4) (16, 4)
您的问题是,当您尝试访问 Dataframe 时最好指定索引或列访问,而 loc 方法是一个很好的选择。在 y 的情况下,你会得到一个很好的结果,因为你在索引之前转换为 pd.Series。
希望对您有所帮助!
我有一个 466 x 1025 的数据框。 1024 个变量和目标组成列。我在数据集上使用随机森林回归,并尝试使用折叠来获得更一致的预测。我的目标是正确拆分,但是当应用于数据时,拆分的是列而不是行。我得到 466 x 372 的训练数据和 466 x 94 的测试数据。我需要 372 x 1024 的训练数据和 94 x 1024 的测试数据。我该如何解决这个问题?注意:当我使用 train_test_split()
时它确实可以正常工作代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
#read the data files, verify types
df = pd.read_csv('./allMolecules.csv')
#the data frame is ready, now it's time for the random forest.
#split data into train and test
xTrain, xTest, yTrain, yTest = train_test_split(finalDF.drop(['Target'], axis=1), finalDF['Target'],test_size=0.2)
model = RandomForestRegressor(n_estimators=1000)
output = model.fit(xTrain,yTrain)
score = model.score(xTest,yTest)
print('Model Settings:\n{0}\n'.format(output))
print('R2: {0}'.format(score))
folds = KFold(n_splits=5)
scores = []
data = finalDF.drop(['Target'], axis=1)
for trainIndex, testIndex in folds.split(finalDF.drop(['Target'], axis=1)):
print(trainIndex, testIndex)
xTrain = data[trainIndex]
xTest = (finalDF.drop(['Target'], axis=1))[testIndex]
yTrain = finalDF['Target'][trainIndex]
yTest = finalDF['Target'][testIndex]
print('\n\n{0}\n\n{1}\n\n{2}\n\n{3}'.format(xTrain,xTest,yTrain,yTest))
output = model.fit(xTrain, yTrain)
scores.append(model.score(xTest, yTest))
print(scores)
我认为您有时可能会弄错索引。 KFold
仅在第一个轴上拆分。
尝试保持简单,在从 folds.split
的结果进行索引之前拆分为 X
和 y
,并改为使用数组:
X = finalDF.drop(['Target'], axis=1).values
y = finalDF.target.values
for trainIndex, testIndex in folds.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
如果您有兴趣使用 pandas 数据框,那么您的问题的解决方案如下所示:
import pandas as pd
from sklearn.model_selection import KFold
X = [[ 0.87, -1.34, 0.31, 0],
[-2.79, -0.02, -0.85, 1],
[-1.34, -0.48, -2.55, 0],
[ 1.92, 1.48, 0.65, 1]]
finalDF = pd.DataFrame(X * 20, columns=['col1', 'col2', 'col3', 'Target'])
folds = KFold(n_splits=5)
scores = []
for trainIndex, testIndex in folds.split(finalDF.drop(['Target'], axis=1)):
# print(trainIndex, testIndex)
xTrain = finalDF.loc[trainIndex, :]
xTest = finalDF.loc[testIndex, :]
print(xTrain.shape, xTest.shape)
对于此示例,您将获得输出(在打印中)
(64, 4) (16, 4)
(64, 4) (16, 4)
(64, 4) (16, 4)
(64, 4) (16, 4)
(64, 4) (16, 4)
您的问题是,当您尝试访问 Dataframe 时最好指定索引或列访问,而 loc 方法是一个很好的选择。在 y 的情况下,你会得到一个很好的结果,因为你在索引之前转换为 pd.Series。
希望对您有所帮助!