在执行 K 折交叉验证后,我们如何在初始 dataset/dataframe 中包含预测列?
How can we include a prediction column in the initial dataset/dataframe after performing K-Fold cross validation?
我想 运行 使用 classifier 对我的数据进行 K 折交叉验证。我想将每个样本的预测(或预测概率)列直接包含到初始 dataset/dataframe 中。有什么想法吗?
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train, y_train)
pred_values = model.predict(X_test)
predict_prob = model.predict_proba(X_test.values)[:,1]
auroc = roc_auc_score(y_test, predict_prob)
acc = accuracy_score(pred_values , y_test)
auroc_score.append(auroc)
acc_score.append(acc)
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))
鉴于此代码,我如何开始产生这样的想法:为初始数据集中的每个样本添加一个预测列,或者更好的是,添加预测概率列?
在 10 折交叉验证中,每个示例(样本)将在测试集中使用一次,在训练集中使用 9 次。因此,经过 10 折交叉验证后,结果应该是一个数据框,我将在其中预测数据集中所有示例的 class。每个示例都将分配其初始特征、其标记 class 以及在交叉验证折叠中计算的 class 预测,其中该示例用于测试集中。
您可以使用 .loc
method to accomplish this. This question 有一个很好的答案显示了如何使用它:df.loc[index_position, "column_name"] = some_value
因此,您发布的代码的编辑版本(我需要数据,并删除了 auc_roc
,因为我们没有根据您的编辑使用概率):
from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier
X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
# Create columns
X['Prediction'] = 1
# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train[model_columns], y_train)
pred_values = model.predict(X_test[model_columns])
acc = accuracy_score(pred_values , y_test)
acc_score.append(acc)
# Add values to the dataframe
X.loc[test_index, 'Prediction'] = pred_values
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
# Add label back per question
X['Label'] = y
# Print first 5 rows to show that it works
print(X.head(n=5))
产量
accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
mean radius mean texture ... Prediction Label
0 17.99 10.38 ... 0 0
1 20.57 17.77 ... 0 0
2 19.69 21.25 ... 0 0
3 11.42 20.38 ... 1 0
4 20.29 14.34 ... 0 0
[5 rows x 32 columns]
(显然 model/values 等都是任意的)
您可以使用 cross_val_predict
,请参阅 help page,它基本上 returns 您交叉验证的估计:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
import pandas as pd
X,y = make_classification()
df = pd.DataFrame(X,columns = ["feature{:02d}".format(i) for i in range(X.shape[1])])
df['label'] = y
df['pred'] = cross_val_predict(LogisticRegression(), X, y, cv=KFold(5))
我想 运行 使用 classifier 对我的数据进行 K 折交叉验证。我想将每个样本的预测(或预测概率)列直接包含到初始 dataset/dataframe 中。有什么想法吗?
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train, y_train)
pred_values = model.predict(X_test)
predict_prob = model.predict_proba(X_test.values)[:,1]
auroc = roc_auc_score(y_test, predict_prob)
acc = accuracy_score(pred_values , y_test)
auroc_score.append(auroc)
acc_score.append(acc)
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
print('AUROC of each fold - {}'.format(auroc_score))
print('Avg AUROC : {}'.format(sum(auroc_score)/k))
鉴于此代码,我如何开始产生这样的想法:为初始数据集中的每个样本添加一个预测列,或者更好的是,添加预测概率列?
在 10 折交叉验证中,每个示例(样本)将在测试集中使用一次,在训练集中使用 9 次。因此,经过 10 折交叉验证后,结果应该是一个数据框,我将在其中预测数据集中所有示例的 class。每个示例都将分配其初始特征、其标记 class 以及在交叉验证折叠中计算的 class 预测,其中该示例用于测试集中。
您可以使用 .loc
method to accomplish this. This question 有一个很好的答案显示了如何使用它:df.loc[index_position, "column_name"] = some_value
因此,您发布的代码的编辑版本(我需要数据,并删除了 auc_roc
,因为我们没有根据您的编辑使用概率):
from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_breast_cancer
from sklearn.neural_network import MLPClassifier
X,y = load_breast_cancer(return_X_y=True, as_frame=True)
model = MLPClassifier()
k = 5
kf = KFold(n_splits=k, random_state=None)
acc_score = []
auroc_score = []
# Create columns
X['Prediction'] = 1
# Define what values to use for the model
model_columns = [x for x in X.columns if x != 'Prediction']
for train_index , test_index in kf.split(X):
X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
y_train , y_test = y[train_index] , y[test_index]
model.fit(X_train[model_columns], y_train)
pred_values = model.predict(X_test[model_columns])
acc = accuracy_score(pred_values , y_test)
acc_score.append(acc)
# Add values to the dataframe
X.loc[test_index, 'Prediction'] = pred_values
avg_acc_score = sum(acc_score)/k
print('accuracy of each fold - {}'.format(acc_score))
print('Avg accuracy : {}'.format(avg_acc_score))
# Add label back per question
X['Label'] = y
# Print first 5 rows to show that it works
print(X.head(n=5))
产量
accuracy of each fold - [0.9210526315789473, 0.9122807017543859, 0.9736842105263158, 0.9649122807017544, 0.8672566371681416]
Avg accuracy : 0.927837292345909
mean radius mean texture ... Prediction Label
0 17.99 10.38 ... 0 0
1 20.57 17.77 ... 0 0
2 19.69 21.25 ... 0 0
3 11.42 20.38 ... 1 0
4 20.29 14.34 ... 0 0
[5 rows x 32 columns]
(显然 model/values 等都是任意的)
您可以使用 cross_val_predict
,请参阅 help page,它基本上 returns 您交叉验证的估计:
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.metrics import accuracy_score
from sklearn import datasets, linear_model
from sklearn.linear_model import LogisticRegression
import pandas as pd
X,y = make_classification()
df = pd.DataFrame(X,columns = ["feature{:02d}".format(i) for i in range(X.shape[1])])
df['label'] = y
df['pred'] = cross_val_predict(LogisticRegression(), X, y, cv=KFold(5))