分析SVM回归结果
analysing SVM regression result
我用SVM回归预测了rainfall.TheJAN到DEC月份的降雨量为x,年降雨量为y.A 80:20 split用于分割训练和测试数据。
from sklearn.svm import SVR
clf = SVR(gamma='auto', C=0.1, epsilon=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
testScore = math.sqrt(mean_squared_error(y_test,y_pred))
print('Test Score: %.2f RMSE' % (testScore))
time_taken = time.time()-t0
print('Time taken', time_taken)
df_SVR = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df_SVR)
执行代码时,我得到了 412.72 RMSE 的分数。并且在每种情况下预测值都是如此。
为什么我的 RMSE 很大,为什么所有的预测值都一样。
在输入训练算法之前,输入样本需要进行特征缩放,例如 minmax 缩放器、std 缩放器。
也许不同特征的大小因您的输入而异。
首先,我真的不明白你为什么选择 gamma='auto'
作为你的超参数之一,但是如果你去掉它并让模型决定将使用哪个 gamma 可能有更好的表现。
而且,小 C 和小 epsilon 可能会以矛盾的方式工作,所以我认为平衡这两个超参数是个好主意。
在这里,我做了一些随机数据,试图找出如何处理它,希望它可以帮助你解决你的问题。
代码:
import numpy as np
from sklearn.svm import SVR
# make data
month_rain = np.random.randint(1000, 5000, size=(10,12))
X = month_rain
y = np.random.randint(3000, 4000, size=(10,1))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
svm_reg = SVR(gamma='auto', C=0.1, epsilon=0.2) # original model
svm_reg2 = SVR(C=0.1, epsilon=0.2) # get rid of gamma
svm_reg3 = SVR(C=100, epsilon=0.2) # get rid of gamma and use larger C
svm_reg4 = SVR(gamma='auto', C=100, epsilon=0.2) # use larger C
svm_reg.fit(X_train, y_train)
svm_reg2.fit(X_train, y_train)
svm_reg3.fit(X_train, y_train)
svm_reg4.fit(X_train, y_train)
# check out the model score in the training dataset.
print(svm_reg.score(X_train, y_train))
print(svm_reg2.score(X_train, y_train))
print(svm_reg3.score(X_train, y_train))
print(svm_reg4.score(X_train, y_train))
# check out the result.
y_pred = svm_reg.predict(X_test)
y_pred2 = svm_reg2.predict(X_test)
y_pred3 = svm_reg3.predict(X_test)
y_pred4 = svm_reg4.predict(X_test)
print(y_test)
print(y_pred.reshape(-1,1))
print(y_pred2.reshape(-1,1))
print(y_pred3.reshape(-1,1))
print(y_pred4.reshape(-1,1))
输出:
score:
-0.05514476528005918
-0.055253731765687375
0.40714376538337693
0.47055666976833854
result:
origin:
[[3690]
[3355]
[3916]]
model 1:
[[3346.]
[3346.]
[3346.]]
model 2:
[[3345.95909456]
[3345.99648151]
[3345.933001 ]]
model 3:
[[3305.09456122]
[3342.48150808]
[3279.00100083]]
model 4:
[[3346.]
[3346.]
[3346.]]
因此,我建议您使用更大的 C 来约束您的模型,它会有更好的性能。
我用SVM回归预测了rainfall.TheJAN到DEC月份的降雨量为x,年降雨量为y.A 80:20 split用于分割训练和测试数据。
from sklearn.svm import SVR
clf = SVR(gamma='auto', C=0.1, epsilon=0.2)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
testScore = math.sqrt(mean_squared_error(y_test,y_pred))
print('Test Score: %.2f RMSE' % (testScore))
time_taken = time.time()-t0
print('Time taken', time_taken)
df_SVR = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df_SVR)
执行代码时,我得到了 412.72 RMSE 的分数。并且在每种情况下预测值都是如此。
为什么我的 RMSE 很大,为什么所有的预测值都一样。
在输入训练算法之前,输入样本需要进行特征缩放,例如 minmax 缩放器、std 缩放器。 也许不同特征的大小因您的输入而异。
首先,我真的不明白你为什么选择 gamma='auto'
作为你的超参数之一,但是如果你去掉它并让模型决定将使用哪个 gamma 可能有更好的表现。
而且,小 C 和小 epsilon 可能会以矛盾的方式工作,所以我认为平衡这两个超参数是个好主意。
在这里,我做了一些随机数据,试图找出如何处理它,希望它可以帮助你解决你的问题。
代码:
import numpy as np
from sklearn.svm import SVR
# make data
month_rain = np.random.randint(1000, 5000, size=(10,12))
X = month_rain
y = np.random.randint(3000, 4000, size=(10,1))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
svm_reg = SVR(gamma='auto', C=0.1, epsilon=0.2) # original model
svm_reg2 = SVR(C=0.1, epsilon=0.2) # get rid of gamma
svm_reg3 = SVR(C=100, epsilon=0.2) # get rid of gamma and use larger C
svm_reg4 = SVR(gamma='auto', C=100, epsilon=0.2) # use larger C
svm_reg.fit(X_train, y_train)
svm_reg2.fit(X_train, y_train)
svm_reg3.fit(X_train, y_train)
svm_reg4.fit(X_train, y_train)
# check out the model score in the training dataset.
print(svm_reg.score(X_train, y_train))
print(svm_reg2.score(X_train, y_train))
print(svm_reg3.score(X_train, y_train))
print(svm_reg4.score(X_train, y_train))
# check out the result.
y_pred = svm_reg.predict(X_test)
y_pred2 = svm_reg2.predict(X_test)
y_pred3 = svm_reg3.predict(X_test)
y_pred4 = svm_reg4.predict(X_test)
print(y_test)
print(y_pred.reshape(-1,1))
print(y_pred2.reshape(-1,1))
print(y_pred3.reshape(-1,1))
print(y_pred4.reshape(-1,1))
输出:
score:
-0.05514476528005918
-0.055253731765687375
0.40714376538337693
0.47055666976833854
result:
origin:
[[3690]
[3355]
[3916]]
model 1:
[[3346.]
[3346.]
[3346.]]
model 2:
[[3345.95909456]
[3345.99648151]
[3345.933001 ]]
model 3:
[[3305.09456122]
[3342.48150808]
[3279.00100083]]
model 4:
[[3346.]
[3346.]
[3346.]]
因此,我建议您使用更大的 C 来约束您的模型,它会有更好的性能。