Python 中的逻辑回归和交叉验证（使用 sklearn）

Question

我正在尝试通过逻辑回归解决给定数据集上的分类问题（这不是问题所在）。为了避免过度拟合，我试图通过交叉验证来实现它（这就是问题所在）：我缺少一些东西来完成这个程序。我在这里的目的是确定准确性。

但让我具体一点。这就是我所做的：

我将集合拆分为训练集和测试集
我定义了要使用的对数回归预测模型
我使用 cross_val_predict 方法（在 sklearn.cross_validation 中）进行预测
最后，我测量了精度

代码如下：

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression

# read training data in pandas dataframe
data = pd.read_csv("./dataset.csv", delimiter=';')
# last column is target, store in array t
t = data['TARGET']
# list of features, including target
features = data.columns
# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# define method
logreg=LogisticRegression()

# cross valitadion prediction
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
print(metrics.accuracy_score(t_train, predicted))

我的问题:

据我了解直到最后才考虑测试集和应该在训练集上进行交叉验证。这就是我在 cross_val_predict 方法中插入 X_train 和 t_train 的原因。 Thuogh，我收到一条错误消息：

ValueError: Found input variables with inconsistent numbers of samples: [6016, 4812]

其中6016是整个数据集的样本数，4812是数据集拆分后训练集中的样本数
这之后，我不知道该怎么办了。我的意思是：X_test 和 t_test 何时发挥作用 ？我不知道在交叉验证后我应该如何使用它们以及如何获得最终的准确性。

奖金问题：我还想执行缩放和降维（通过特征选择或 PCA）在交叉验证的每个步骤中。我怎样才能做到这一点？我已经看到定义管道有助于扩展，但我不知道如何将其应用于第二个问题。

非常感谢任何帮助:-)

Answer 1

这是在示例数据帧上测试的工作代码。您代码中的第一个问题是目标数组不是 np.array。您也不应该在您的功能中包含目标数据。下面我将说明如何使用 train_test_split 手动拆分训练和测试数据。我还展示了如何使用包装器 cross_val_score 自动拆分、拟合和评分。

random.seed(42)
# Create example df with alphabetic col names.
alphabet_cols = list(string.ascii_uppercase)[:26]
df = pd.DataFrame(np.random.randint(1000, size=(1000, 26)),
                  columns=alphabet_cols)
df['Target'] = df['A']
df.drop(['A'], axis=1, inplace=True)
print(df.head())
y = df.Target.values  # df['Target'] is not an np.array.
feature_cols = [i for i in list(df.columns) if i != 'Target']
X = df.ix[:, feature_cols].as_matrix()
# Illustrated here for manual splitting of training and testing data.
X_train, X_test, y_train, y_test = \
    model_selection.train_test_split(X, y, test_size=0.2, random_state=0)

# Initialize model.
logreg = linear_model.LinearRegression()

# Use cross_val_score to automatically split, fit, and score.
scores = model_selection.cross_val_score(logreg, X, y, cv=10)
print(scores)
print('average score: {}'.format(scores.mean()))

输出

     B    C    D    E    F    G    H    I    J    K   ...    Target
0   20   33  451    0  420  657  954  156  200  935   ...    253
1  427  533  801  183  894  822  303  623  455  668   ...    421
2  148  681  339  450  376  482  834   90   82  684   ...    903
3  289  612  472  105  515  845  752  389  532  306   ...    639
4  556  103  132  823  149  974  161  632  153  782   ...    347

[5 rows x 26 columns]
[-0.0367 -0.0874 -0.0094 -0.0469 -0.0279 -0.0694 -0.1002 -0.0399  0.0328
 -0.0409]
average score: -0.04258093018969249

有用的参考资料：

Answer 2

请查看documentation of cross-validation at scikit以进一步了解它。

此外，您使用的 cross_val_predict 不正确。它将做的是在内部调用您提供的 cv (cv=10) 以将提供的数据（即您的情况下的 X_train、t_train）拆分为再次训练和测试，将估计器拟合到训练上并预测保留在测试中的数据。

现在要使用你的 X_test、y_test，你应该首先将你的估计器拟合到训练数据上（cross_val_predict 将不拟合），然后用它来预测测试数据，然后计算准确率。

描述上述内容的简单代码片段（借用您的代码）（请阅读评论并询问是否有任何不明白的地方）：

# item feature matrix in X
X = data[features[:-1]].as_matrix()
# remove first column because it is not necessary in the analysis
X = np.delete(X,0,axis=1)
# divide in training and test set
X_train, X_test, t_train, t_test = train_test_split(X, t, test_size=0.2, random_state=0)

# Until here everything is good
# You keep away 20% of data for testing (test_size=0.2)
# This test data should be unseen by any of the below methods

# define method
logreg=LogisticRegression()

# Ideally what you are doing here should be correct, until you did anything wrong in dataframe operations (which apparently has been solved)
#cross valitadion prediction
#This cross validation prediction will print the predicted values of 't_train'
predicted = cross_validation.cross_val_predict(logreg, X_train, t_train, cv=10)
# internal working of cross_val_predict:
  #1. Get the data and estimator (logreg, X_train, t_train)
  #2. From here on, we will use X_train as X_cv and t_train as t_cv (because cross_val_predict doesnt know that its our training data) - Doubts??
  #3. Split X_cv, t_cv into X_cv_train, X_cv_test, t_cv_train, t_cv_test by using its internal cv
  #4. Use X_cv_train, t_cv_train for fitting 'logreg' 
  #5. Predict on X_cv_test (No use of t_cv_test)
  #6. Repeat steps 3 to 5 repeatedly for cv=10 iterations, each time using different data for training and different data for testing.

# So here you are correctly comparing 'predicted' and 't_train'
print(metrics.accuracy_score(t_train, predicted)) 

# The above metrics will show you how our estimator 'logreg' works on 'X_train' data. If the accuracies are very high it may be because of overfitting.

# Now what to do about the X_test and t_test above.
# Actually the correct preference for metrics is this X_test and t_train
# If you are satisfied by the accuracies on the training data then you should fit the entire training data to the estimator and then predict on X_test

logreg.fit(X_train, t_train)
t_pred = logreg(X_test)

# Here is the final accuracy
print(metrics.accuracy_score(t_test, t_pred)) 
# If this accuracy is good, then your model is good.

如果您的数据较少或不想将数据拆分为训练和测试，那么您应该使用@fuzzyhedge

建议的方法

# Use cross_val_score on your all data
scores = model_selection.cross_val_score(logreg, X, y, cv=10)

# 'cross_val_score' will almost work same from steps 1 to 4
  #5. t_cv_pred = logreg.predict(X_cv_test) and calculate accuracy with t_cv_test. 
  #6. Repeat steps 1 to 5 for cv_iterations = 10
  #7. Return array of accuracies calculated in step 5.

# Find out average of returned accuracies to see the model performance
scores = scores.mean()

注意 - 此外，cross_validation 最好与 gridsearch 一起使用，以找出对给定数据表现最佳的估计器参数。例如，使用 LogisticRegression 它定义了许多参数。但是如果你使用

logreg = LogisticRegression()

将仅使用默认参数初始化模型。也许参数的不同值

logreg = LogisticRegression(penalty='l1', solver='liblinear')

可能会更好地处理您的数据。这个搜索更好的参数是gridsearch.

现在关于 scaling, dimension reductions etc using pipeline. You can refer to the documentation of pipeline 的第二部分和以下示例：

如果需要任何帮助，请随时与我联系。

Python 中的逻辑回归和交叉验证（使用 sklearn）

Logistic regression and cross-validation in Python (with sklearn)

python

classification

machine-learning

scikit-learn

cross-validation