为什么TPOT推荐分类器的分数低于LinearSVC?
Why does the score of TPOT recommend classifier is lower than LinearSVC?
所以我发现 LinearSVC 在 TPOT 分类器中,我一直在将它用于我的模型并获得相当不错的分数(sklearn 分数为 0.95)。
def process(stock):
df = format_data(stock)
df[['HSI Volume', 'HSI', stock]] = df[['HSI Volume', 'HSI', stock]].pct_change()
# shift future value to current date
df[stock+'_future'] = df[stock].shift(-1)
df.replace([-np.inf, np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
df['class'] = list(map(create_labels, df[stock], df[stock+'_future']))
X = np.array(df.drop(['class', stock+'_future'], 1)) # 1 = column
# X = preprocessing.scale(X)
y = np.array(df['class'])
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
tpot = TPOTClassifier(generations = 10, verbosity=2)
fitting = tpot.fit(X_train, y_train)
prediction = tpot.score(X_test, y_test)
tpot.export('pipeline.py')
return fitting, prediction
十代之后:TPOT推荐GaussianNB,在sklearn score上得分0.77左右。
Generation 1 - Current best internal CV score: 0.5322255571
Generation 2 - Current best internal CV score: 0.55453535828
Generation 3 - Current best internal CV score: 0.55453535828
Generation 4 - Current best internal CV score: 0.55453535828
Generation 5 - Current best internal CV score: 0.587469903893
Generation 6 - Current best internal CV score: 0.587469903893
Generation 7 - Current best internal CV score: 0.597194474469
Generation 8 - Current best internal CV score: 0.597194474469
Generation 9 - Current best internal CV score: 0.597194474469
Generation 10 - Current best internal CV score: 0.597194474469
Best pipeline: GaussianNB(RBFSampler(input_matrix, 0.22))
(None, 0.54637855142056824)
我很好奇为什么LinearSVC评分高,TPOT却不推荐。是不是因为打分机制不同导致最优分类器不同?
非常感谢!
我个人的猜测是 tpot 停留在局部最大值上,也许尝试更改测试大小、进行更多代或缩放数据可能会有所帮助。也,你能重做 TPOT 看看你是否得到相同的结果吗? (我的猜测是否定的,因为遗传优化由于变异而不确定)
所以我发现 LinearSVC 在 TPOT 分类器中,我一直在将它用于我的模型并获得相当不错的分数(sklearn 分数为 0.95)。
def process(stock):
df = format_data(stock)
df[['HSI Volume', 'HSI', stock]] = df[['HSI Volume', 'HSI', stock]].pct_change()
# shift future value to current date
df[stock+'_future'] = df[stock].shift(-1)
df.replace([-np.inf, np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
df['class'] = list(map(create_labels, df[stock], df[stock+'_future']))
X = np.array(df.drop(['class', stock+'_future'], 1)) # 1 = column
# X = preprocessing.scale(X)
y = np.array(df['class'])
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2)
tpot = TPOTClassifier(generations = 10, verbosity=2)
fitting = tpot.fit(X_train, y_train)
prediction = tpot.score(X_test, y_test)
tpot.export('pipeline.py')
return fitting, prediction
十代之后:TPOT推荐GaussianNB,在sklearn score上得分0.77左右。
Generation 1 - Current best internal CV score: 0.5322255571
Generation 2 - Current best internal CV score: 0.55453535828
Generation 3 - Current best internal CV score: 0.55453535828
Generation 4 - Current best internal CV score: 0.55453535828
Generation 5 - Current best internal CV score: 0.587469903893
Generation 6 - Current best internal CV score: 0.587469903893
Generation 7 - Current best internal CV score: 0.597194474469
Generation 8 - Current best internal CV score: 0.597194474469
Generation 9 - Current best internal CV score: 0.597194474469
Generation 10 - Current best internal CV score: 0.597194474469
Best pipeline: GaussianNB(RBFSampler(input_matrix, 0.22))
(None, 0.54637855142056824)
我很好奇为什么LinearSVC评分高,TPOT却不推荐。是不是因为打分机制不同导致最优分类器不同?
非常感谢!
我个人的猜测是 tpot 停留在局部最大值上,也许尝试更改测试大小、进行更多代或缩放数据可能会有所帮助。也,你能重做 TPOT 看看你是否得到相同的结果吗? (我的猜测是否定的,因为遗传优化由于变异而不确定)