在 python 中查找线性回归的均方误差（使用 scikit 学习）

Question

我正在尝试在 python 中做一个简单的线性回归，其中 x 变量是单词项目描述的计数和 y 值是以天为单位的资金速度。

我有点困惑，因为测试的均方根误差 (RMSE) 是 13.77 训练数据为 13.88。首先，RMSE 不应该在 0 和 1 之间吗？其次，测试数据的 RMSE 不应该高于训练数据吗？所以我想，我做错了什么，但不确定错误在哪里。

此外，我需要知道回归的权重系数，但不幸的是不知道如何打印它，因为它有点隐藏在 sklearn 方法中。有人可以帮忙吗？

这是我目前拥有的：

import numpy as np
import matplotlib.pyplot as plt
import sqlite3
from sklearn.model_selection import train_test_split
from sklearn import linear_model

con = sqlite3.connect('database.db')
cur = con.cursor()

# y-variable in regression is funding speed ("DAYS_NEEDED")    
cur.execute("SELECT DAYS_NEEDED FROM success")
y = cur.fetchall()                  # list of tuples
y = np.array([i[0] for i in y])     # list of int   # y.shape = (1324476,)

# x-variable in regression is the project description length ("WORD_COUNT")
cur.execute("SELECT WORD_COUNT FROM success")
x = cur.fetchall()
x = np.array([i[0] for i in x])     # list of int   # x.shape = (1324476,)

# Get the train and test data split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a model
lm = linear_model.LinearRegression()
x_train = x_train.reshape(-1, 1)    # new shape: (1059580, 1)
y_train = y_train.reshape(-1, 1)    # new shape: (1059580, 1)
model = lm.fit(x_train, y_train)
x_test = x_test.reshape(-1, 1)      # new shape: (264896, 1)
predictions_test = lm.predict(x_test)
predictions_train = lm.predict(x_train)

print("y_test[5]: ", y_test[5])     # 14
print("predictions[5]: ", predictions_test[5]) # [ 12.6254537]

# Calculate the root mean square error (RMSE) for test and training data
N = len(y_test)
rmse_test = np.sqrt(np.sum((np.array(y_test).flatten() - np.array(predictions_test).flatten())**2)/N)
print("RMSE TEST: ", rmse_test)     # 13.770731326

N = len(y_train)
rmse_train = np.sqrt(np.sum((np.array(y_train).flatten() - np.array(predictions_train).flatten())**2)/N)
print("RMSE train: ", rmse_train)   # 13.8817814595

非常感谢任何帮助！谢谢！

Answer 1

RMSE 与因变量的单位相同。这意味着如果您尝试预测的变量在 0 到 100 之间变化，则 RMSE 为 99 是很糟糕的！如果说对于 0 到 100 范围内的数据，RMSE 为 5，那么 RMSE 为 5 是非常惊人的。但是，如果从 1 到 10 的数据的 RMSE 是 5，那么你就有问题了！我希望这能够说明这一点。
由于您的训练和测试的 RMSE 相似，请为自己鼓掌！你其实已经做得很好了！如果 RMSE of test > train，说明你有点过拟合。

根据 Umang 在评论中所说的，您使用 model.coef_ 和 model.intercept_ 来打印您的模型计算出的最佳权重。

在 python 中查找线性回归的均方误差（使用 scikit 学习）

Finding the mean squared error for a linear regression in python (with scikit learn)

python

linear-regression

mse

scikit-learn