在 Pyspark 中使用线性回归进行线拟合给出截然不同的系数
Line Fitting using LinearRegression in Pyspark gives wildly different coeffecients
我有一个这样的数据框:
+---------+------------------+
|rownumber| Moving_Ratio|
+---------+------------------+
| 1000|105.67198820168865|
| 1001|105.65729748456914|
| 1002| 105.6426671752822|
| 1003|105.62808965618223|
| 1004|105.59623035662119|
| 1005|105.52385366516299|
| 1006|105.44762361744378|
| 1007|105.35977134665733|
| 1008|105.25685407339793|
| 1009|105.16307473993363|
| 1010|105.06600545864703|
| 1011|104.96056753478364|
| 1012|104.84525664217107|
| 1013| 104.7401615868953|
| 1014| 104.6283459710509|
| 1015|104.53484736833259|
| 1017|104.43492576734955|
| 1019|104.33599903547659|
| 1020|104.24640223269283|
| 1021|104.15275303890549|
+---------+------------------+
有 10k 行,我只是为了示例视图截断了它。
数据绝不是线性的,看起来像这样:
但是,我并不担心每个数据点的完美拟合。我基本上是在寻找一条捕捉曲线方向并找到其斜率的线。如统计软件生成的图像中的绿线所示。
我试图放入一行的特征列是 Moving_Ratio
Moving_Ratio
的最小值和最大值为:
+-----------------+------------------+
|min(Moving_Ratio)| max(Moving_Ratio)|
+-----------------+------------------+
|26.73629202745194|121.84100616620908|
+-----------------+------------------+
我尝试使用以下代码创建一个简单的线性模型:
vect_assm = VectorAssembler(inputCols =['Moving_Ratio'], outputCol='features')
df_vect=vect_assm.transform(df)\
lir = LinearRegression(featuresCol = 'features', labelCol='rownumber', maxIter=50,
regParam=0.3, elasticNetParam=0.8)
model = lir.fit(df_vect)
Predictions = model.transform(df_vect)
coeff=model.coefficients
当我查看预测时,我得到的值似乎与那些行号对应的原始数据相去甚远。
Predictions.show()
+---------+------------------+--------------------+-----------------+
|rownumber| Moving_Ratio| features| prediction|
+---------+------------------+--------------------+-----------------+
| 1000|105.67198820168865|[105.67198820168865]|8935.419272488462|
| 1001|105.65729748456914|[105.65729748456914]| 8934.20373303444|
| 1002| 105.6426671752822| [105.6426671752822]|8932.993191845864|
| 1003|105.62808965618223|[105.62808965618223]|8931.787018623438|
| 1004|105.59623035662119|[105.59623035662119]|8929.150916159619|
| 1005|105.52385366516299|[105.52385366516299]| 8923.1623232745|
| 1006|105.44762361744378|[105.44762361744378]|8916.854895949407|
| 1007|105.35977134665733|[105.35977134665733]| 8909.58582253401|
| 1008|105.25685407339793|[105.25685407339793]|8901.070240542358|
| 1009|105.16307473993363|[105.16307473993363]|8893.310750051145|
| 1010|105.06600545864703|[105.06600545864703]|8885.279042666287|
| 1011|104.96056753478364|[104.96056753478364]| 8876.55489697866|
| 1012|104.84525664217107|[104.84525664217107]|8867.013842017961|
| 1013| 104.7401615868953| [104.7401615868953]|8858.318065966234|
| 1014| 104.6283459710509| [104.6283459710509]|8849.066217228752|
| 1015|104.53484736833259|[104.53484736833259]|8841.329954963563|
| 1017|104.43492576734955|[104.43492576734955]|8833.062240915566|
| 1019|104.33599903547659|[104.33599903547659]|8824.876844336828|
| 1020|104.24640223269283|[104.24640223269283]|8817.463424838508|
| 1021|104.15275303890549|[104.15275303890549]| 8809.71470236567|
+---------+------------------+--------------------+-----------------+
Predictions.select(min('prediction'),max('prediction')).show()
+-----------------+------------------+
| min(prediction)| max(prediction)|
+-----------------+------------------+
|2404.121157489531|10273.276308929268|
+-----------------+------------------+
coeff[0]
82.74200940195973
预测的最小值和最大值完全在输入数据之外。
我究竟做错了什么?
任何帮助将不胜感激
初始化 LinearRegression 对象时,featuresCol 应列出所有特征(自变量),而 labelCol 应列出标签(因变量)。由于您要预测 'Moving_Ratio',因此设置 featuresCol='rownumber' 和 labelCol='Moving_Ratio' 以正确指定 LinearRegression。
我有一个这样的数据框:
+---------+------------------+
|rownumber| Moving_Ratio|
+---------+------------------+
| 1000|105.67198820168865|
| 1001|105.65729748456914|
| 1002| 105.6426671752822|
| 1003|105.62808965618223|
| 1004|105.59623035662119|
| 1005|105.52385366516299|
| 1006|105.44762361744378|
| 1007|105.35977134665733|
| 1008|105.25685407339793|
| 1009|105.16307473993363|
| 1010|105.06600545864703|
| 1011|104.96056753478364|
| 1012|104.84525664217107|
| 1013| 104.7401615868953|
| 1014| 104.6283459710509|
| 1015|104.53484736833259|
| 1017|104.43492576734955|
| 1019|104.33599903547659|
| 1020|104.24640223269283|
| 1021|104.15275303890549|
+---------+------------------+
有 10k 行,我只是为了示例视图截断了它。
数据绝不是线性的,看起来像这样:
但是,我并不担心每个数据点的完美拟合。我基本上是在寻找一条捕捉曲线方向并找到其斜率的线。如统计软件生成的图像中的绿线所示。
我试图放入一行的特征列是 Moving_Ratio
Moving_Ratio
的最小值和最大值为:
+-----------------+------------------+
|min(Moving_Ratio)| max(Moving_Ratio)|
+-----------------+------------------+
|26.73629202745194|121.84100616620908|
+-----------------+------------------+
我尝试使用以下代码创建一个简单的线性模型:
vect_assm = VectorAssembler(inputCols =['Moving_Ratio'], outputCol='features')
df_vect=vect_assm.transform(df)\
lir = LinearRegression(featuresCol = 'features', labelCol='rownumber', maxIter=50,
regParam=0.3, elasticNetParam=0.8)
model = lir.fit(df_vect)
Predictions = model.transform(df_vect)
coeff=model.coefficients
当我查看预测时,我得到的值似乎与那些行号对应的原始数据相去甚远。
Predictions.show()
+---------+------------------+--------------------+-----------------+
|rownumber| Moving_Ratio| features| prediction|
+---------+------------------+--------------------+-----------------+
| 1000|105.67198820168865|[105.67198820168865]|8935.419272488462|
| 1001|105.65729748456914|[105.65729748456914]| 8934.20373303444|
| 1002| 105.6426671752822| [105.6426671752822]|8932.993191845864|
| 1003|105.62808965618223|[105.62808965618223]|8931.787018623438|
| 1004|105.59623035662119|[105.59623035662119]|8929.150916159619|
| 1005|105.52385366516299|[105.52385366516299]| 8923.1623232745|
| 1006|105.44762361744378|[105.44762361744378]|8916.854895949407|
| 1007|105.35977134665733|[105.35977134665733]| 8909.58582253401|
| 1008|105.25685407339793|[105.25685407339793]|8901.070240542358|
| 1009|105.16307473993363|[105.16307473993363]|8893.310750051145|
| 1010|105.06600545864703|[105.06600545864703]|8885.279042666287|
| 1011|104.96056753478364|[104.96056753478364]| 8876.55489697866|
| 1012|104.84525664217107|[104.84525664217107]|8867.013842017961|
| 1013| 104.7401615868953| [104.7401615868953]|8858.318065966234|
| 1014| 104.6283459710509| [104.6283459710509]|8849.066217228752|
| 1015|104.53484736833259|[104.53484736833259]|8841.329954963563|
| 1017|104.43492576734955|[104.43492576734955]|8833.062240915566|
| 1019|104.33599903547659|[104.33599903547659]|8824.876844336828|
| 1020|104.24640223269283|[104.24640223269283]|8817.463424838508|
| 1021|104.15275303890549|[104.15275303890549]| 8809.71470236567|
+---------+------------------+--------------------+-----------------+
Predictions.select(min('prediction'),max('prediction')).show()
+-----------------+------------------+
| min(prediction)| max(prediction)|
+-----------------+------------------+
|2404.121157489531|10273.276308929268|
+-----------------+------------------+
coeff[0]
82.74200940195973
预测的最小值和最大值完全在输入数据之外。 我究竟做错了什么? 任何帮助将不胜感激
初始化 LinearRegression 对象时,featuresCol 应列出所有特征(自变量),而 labelCol 应列出标签(因变量)。由于您要预测 'Moving_Ratio',因此设置 featuresCol='rownumber' 和 labelCol='Moving_Ratio' 以正确指定 LinearRegression。