分析SVM回归结果

analysing SVM regression result

我用SVM回归预测了rainfall.TheJAN到DEC月份的降雨量为x,年降雨量为y.A 80:20 split用于分割训练和测试数据。

from sklearn.svm import SVR
    clf = SVR(gamma='auto', C=0.1, epsilon=0.2)
    clf.fit(X_train, y_train) 
    y_pred = clf.predict(X_test)
    testScore = math.sqrt(mean_squared_error(y_test,y_pred))
    print('Test Score: %.2f RMSE' % (testScore))
    time_taken = time.time()-t0
    print('Time taken', time_taken)
    df_SVR = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
    print(df_SVR)
    

执行代码时,我得到了 412.72 RMSE 的分数。并且在每种情况下预测值都是如此。

为什么我的 RMSE 很大,为什么所有的预测值都一样。

在输入训练算法之前,输入样本需要进行特征缩放,例如 minmax 缩放器、std 缩放器。 也许不同特征的大小因您的输入而异。

首先,我真的不明白你为什么选择 gamma='auto' 作为你的超参数之一,但是如果你去掉它并让模型决定将使用哪个 gamma 可能有更好的表现。

而且,小 C 和小 epsilon 可能会以矛盾的方式工作,所以我认为平衡这两个超参数是个好主意。

在这里,我做了一些随机数据,试图找出如何处理它,希望它可以帮助你解决你的问题。

代码:

import numpy as np
from sklearn.svm import SVR

# make data
month_rain = np.random.randint(1000, 5000, size=(10,12))
X = month_rain
y = np.random.randint(3000, 4000, size=(10,1))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

svm_reg = SVR(gamma='auto', C=0.1, epsilon=0.2) # original model
svm_reg2 = SVR(C=0.1, epsilon=0.2) # get rid of gamma
svm_reg3 = SVR(C=100, epsilon=0.2) # get rid of gamma and use larger C
svm_reg4 = SVR(gamma='auto', C=100, epsilon=0.2) # use larger C

svm_reg.fit(X_train, y_train)
svm_reg2.fit(X_train, y_train)
svm_reg3.fit(X_train, y_train)
svm_reg4.fit(X_train, y_train)

# check out the model score in the training dataset.
print(svm_reg.score(X_train, y_train))
print(svm_reg2.score(X_train, y_train))
print(svm_reg3.score(X_train, y_train))
print(svm_reg4.score(X_train, y_train))

# check out the result.
y_pred = svm_reg.predict(X_test)
y_pred2 = svm_reg2.predict(X_test)
y_pred3 = svm_reg3.predict(X_test)
y_pred4 = svm_reg4.predict(X_test)
print(y_test)
print(y_pred.reshape(-1,1))
print(y_pred2.reshape(-1,1))
print(y_pred3.reshape(-1,1))
print(y_pred4.reshape(-1,1))

输出:

score:
-0.05514476528005918
-0.055253731765687375
0.40714376538337693
0.47055666976833854

result:
origin:
[[3690]
 [3355]
 [3916]]

model 1:
[[3346.]
 [3346.]
 [3346.]]

model 2:
[[3345.95909456]
 [3345.99648151]
 [3345.933001  ]]

model 3:
[[3305.09456122]
 [3342.48150808]
 [3279.00100083]]

model 4:
[[3346.]
 [3346.]
 [3346.]]

因此,我建议您使用更大的 C 来约束您的模型,它会有更好的性能。