在 Python 中为数据库中的每一行拟合并预测线性回归

Question

大家晚上好，我是 Python 的新手，我正在尝试通过复制我在 Excel

上的模型来学习

我需要复制“TREND”函数来拟合两个极值点之间的小型线性模型，比方说

A = (1, 0.15) B= (5,0.2)

并使用给定值进行预测（假设为 4.2）。

出于此代码的目的，我需要为数据库的每一行拟合一个模型。所有x值都是x_1=1和x_2=5，而每行的y值不同。

我尝试以这种方式使用 sklearn.linear_model 包中的 LinearRegression() 和 model.predict

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = {'New_x':[5, 2.1, 4.5, 3.0],
        'X1':[1, 1, 1, 1],
        'X2':[5, 5, 5, 5],
        'Y1':[0.15, 0.7, 1.35, 0.2],
        'Y2':[0.2, 0.85, 1.55, 0.4]}  

df=pd.DataFrame(data,index=["1","2","3","4"])

model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
prediction=model.predict(df["New_x"].values.reshape(-1,1))

但是我遇到了这个错误

    ValueError                                Traceback (most recent call last)
<ipython-input-88-da83cb57bf4a> in <module>()
     18 
     19 model=LinearRegression().fit(df[["X1","X2"]],df[["Y1","Y2"]])
---> 20 prediction=model.predict(df["New_x"].values.reshape(-1,1))
     21 
     22 #model = LinearRegression().fit(SEC_ERBA_sample[["Vertex1","Vertex2"]], SEC_ERBA_sample[["SENIOR_1Y","SENIOR_5Y"]])

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
    254             Returns predicted values.
    255         """
--> 256         return self._decision_function(X)
    257 
    258     _preprocess_data = staticmethod(_preprocess_data)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
    239         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
    240         return safe_sparse_dot(X, self.coef_.T,
--> 241                                dense_output=True) + self.intercept_
    242 
    243     def predict(self, X):

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
    138         return ret
    139     else:
--> 140         return np.dot(a, b)
    141 
    142 

ValueError: shapes (4,1) and (2,2) not aligned: 1 (dim 1) != 2 (dim 0)

所以我假设 LinearRegression().fit 正在拟合一个基于列值的独特模型。有没有办法为每一行拟合和预测线性回归？

Answer 1

我认为这是一个简单的代码错字，但可能是基于更深层次的概念问题，所以我会尽量给你一个更广泛的答案。 sklearn.base.BaseEstimator#fit 通过将一组特征 X 与一组真实值 y 相关联来训练 ML 模型。在您的示例中，您正在训练两个多变量回归模型来估计 Y1 和 Y2 变量，同时考虑 X1 和 X2：

model = LinearRegression().fit(df[["X1","X2"]], df[["Y1","Y2"]])

因此模型学会了在考虑两个其他变量的情况下估计这两个变量。在预测期间，模型需要准确的变量（X1 和 X2）才能预测感兴趣的值。

predictions = model.predict(df[["New_x1", "New_x2"]])

如果 New_x2 信息在测试（预测）期间不可用，那么您要么也必须估计它，要么将其完全从训练中删除。

一个简单的抽象示例：如果模型被训练为根据您的身高和体重估计您喜欢的 T 恤尺码，您需要在测试期间同时知道身高和体重（预测）获得正确尺寸估计的时间。

Answer 2

我找到了一个使用 iterrow() 的解决方案。仍然不完整，因为我无法保存输出，但我想我会为此打开一个单独的、更有针对性的问题

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

data = {'New_x':[5, 2.1, 4.5, 3.0],
        'X1':[1., 1, 1, 1],
        'X2':[5., 5, 5, 5],
        'Y1':[0.15, 0.7, 1.35, 0.2],
        'Y2':[0.2, 0.85, 1.55, 0.4]}  

df=pd.DataFrame(data,index=["1","2","3","4"])

这最后一块允许迭代线性回归。不建议使用 iterrows()，因为许多操作可以运行以不同的方式（包括矢量化）但在这种情况下，我没有找到解决此问题的替代解决方案

for index, row in df.iterrows():
    model=LinearRegression().fit(np.array([row["X1"],row["X2"]]).reshape(-1,1),
                                 np.array([row["Y1"],row["Y2"]]))
    print(model.predict(row["New_x"]))

在 Python 中为数据库中的每一行拟合并预测线性回归

Fit and predict linear regression for each row in the database in Python

python

regression

numpy

pandas

scikit-learn