为什么 ShuffleSplit more/less 比 train_test_split 随机(random_state=None)?
Why is ShuffleSplit more/less random than train_test_split (with random_state=None)?
考虑以下两个选项:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#sklearn.__version__ 17.1
#python --version 3.5.2, Anaconda 4.1.1 (64-bit)
#ipdb> TypeError: __init__() got an unexpected keyword argument 'n_splits'
#None
#> <string>(1)<module>()
import numpy as np
from sklearn.datasets import load_boston
#from sklearn.model_selection import train_test_split, cross_val_score
#from sklearn.model_selection import ShuffleSplit
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cross_validation import ShuffleSplit
from sklearn.ensemble import GradientBoostingRegressor
# define feature matrix and target variable
X, y = load_boston().data, load_boston().target
# Create Algorithm Object (Gradient Boosting)
gbr = GradientBoostingRegressor(n_estimators=100, random_state=0)
#====================================================
# Option B
#====================================================
#shuffle = ShuffleSplit(n_splits=10, train_size=0.75, random_state=0)
shuffle = ShuffleSplit(n=X.shape[0], n_iter=10, train_size=0.75, random_state=0)
cross_val = cross_val_score(gbr, X, y, cv=shuffle)
print('------------------------------------------')
print('Individual performance: ', cross_val)
print('===============================================')
print('Option B: Average performance: ', cross_val.mean())
print('===============================================')
# --> different performance in every iteration because of different training
# and test sets.
#====================================================
# Option C
#====================================================
individual_results = []
iterations = np.arange(1, 11)
for i in iterations:
# randomly split the data into train and test
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25,
random_state=None)
# train gbr 10x with on new data set
gbr.fit(Xtrain, ytrain)
score = gbr.score(Xtrain, ytrain)
individual_results.append(score)
avg_score = sum(individual_results)/len(iterations)
print('------------------------------------------')
print(individual_results)
print('===============================================')
print('Option C: Average Performance: ', avg_score)
print('===============================================')
这是输出的副本:
Individual performance: [ 0.77535372 0.81760604 0.87146377 0.94041114 0.92648961 0.87761488
0.82843891 0.81833855 0.90167889 0.90014986]
===============================================
Option B: Average performance: 0.865754537049
===============================================
------------------------------------------
[0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402]
===============================================
Option C: Average Performance: 0.980088233434
===============================================
谁能帮忙解释一下为什么选项 B 中的 ShuffleSplit 函数比选项 C 中的 train_test_split 函数(使用 random_state=None)呈现更多随机结果?
选项 C
中的分数是根据 Xtrain
而不是 XTest
计算的
有
score = gbr.score(Xtest, ytest)
现在的分数是
[0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]
考虑以下两个选项:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
#sklearn.__version__ 17.1
#python --version 3.5.2, Anaconda 4.1.1 (64-bit)
#ipdb> TypeError: __init__() got an unexpected keyword argument 'n_splits'
#None
#> <string>(1)<module>()
import numpy as np
from sklearn.datasets import load_boston
#from sklearn.model_selection import train_test_split, cross_val_score
#from sklearn.model_selection import ShuffleSplit
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.cross_validation import ShuffleSplit
from sklearn.ensemble import GradientBoostingRegressor
# define feature matrix and target variable
X, y = load_boston().data, load_boston().target
# Create Algorithm Object (Gradient Boosting)
gbr = GradientBoostingRegressor(n_estimators=100, random_state=0)
#====================================================
# Option B
#====================================================
#shuffle = ShuffleSplit(n_splits=10, train_size=0.75, random_state=0)
shuffle = ShuffleSplit(n=X.shape[0], n_iter=10, train_size=0.75, random_state=0)
cross_val = cross_val_score(gbr, X, y, cv=shuffle)
print('------------------------------------------')
print('Individual performance: ', cross_val)
print('===============================================')
print('Option B: Average performance: ', cross_val.mean())
print('===============================================')
# --> different performance in every iteration because of different training
# and test sets.
#====================================================
# Option C
#====================================================
individual_results = []
iterations = np.arange(1, 11)
for i in iterations:
# randomly split the data into train and test
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25,
random_state=None)
# train gbr 10x with on new data set
gbr.fit(Xtrain, ytrain)
score = gbr.score(Xtrain, ytrain)
individual_results.append(score)
avg_score = sum(individual_results)/len(iterations)
print('------------------------------------------')
print(individual_results)
print('===============================================')
print('Option C: Average Performance: ', avg_score)
print('===============================================')
这是输出的副本:
Individual performance: [ 0.77535372 0.81760604 0.87146377 0.94041114 0.92648961 0.87761488
0.82843891 0.81833855 0.90167889 0.90014986]
===============================================
Option B: Average performance: 0.865754537049
===============================================
------------------------------------------
[0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402]
===============================================
Option C: Average Performance: 0.980088233434
===============================================
谁能帮忙解释一下为什么选项 B 中的 ShuffleSplit 函数比选项 C 中的 train_test_split 函数(使用 random_state=None)呈现更多随机结果?
选项 C
中的分数是根据Xtrain
而不是 XTest
计算的
有
score = gbr.score(Xtest, ytest)
现在的分数是
[0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]