Pandas :根据前一行的线性回归计算新列
Pandas : Compute a new column based on linear regression of previous row
我的数据框如下所示:
date Temperature consumption
0 2020-12-01 8.0125 109.046450
1 2020-12-02 6.1500 104.494946
2 2020-12-03 5.9375 117.011582
3 2020-12-04 5.4750 109.615388
4 2020-12-05 3.8500 142.803438
5 2020-12-06 2.0500 158.638879
6 2020-12-07 0.1250 86.194107
7 2020-12-08 1.4750 121.847555
8 2020-12-09 2.4250 99.658973
10 2020-12-11 3.4250 76.806630
11 2020-12-12 7.5375 83.064948
12 2020-12-13 5.6750 82.401187
13 2020-12-14 9.9250 58.695437
14 2020-12-15 9.2875 64.574463
15 2020-12-16 7.0250 68.367383
16 2020-12-17 8.9125 84.487293
17 2020-12-18 8.6875 69.031144
18 2020-12-19 8.9500 65.048578
19 2020-12-20 8.6000 91.911185
20 2020-12-21 8.7625 60.022959
21 2020-12-22 12.7375 40.489421
22 2020-12-23 11.9875 43.049642
23 2020-12-24 6.1625 108.761981
24 2020-12-25 3.6875 105.727645
25 2020-12-26 3.8625 108.003397
我想创建一个名为 'slope15' 的新列,其值是前 15 行的线性回归 'consumption~Temperature' 的斜率。我怎样才能做到这一点?我尝试使用 .shift(15) 和 stats.linregress() 但它没有按预期工作。
Tyvm
我不喜欢迭代,但我想不出更优雅的方式。我能够完成这项工作:
from scipy import stats
df['slope15'] = np.nan
for i in np.arange(15, df.shape[0]):
slope, intercept, r, p, se = stats.linregress(
df.loc[i-15:i, 'Temperature'],
df.loc[i-15:i, 'consumption']
)
df.loc[i, 'slope15'] = slope
您可以使用 rolling window 并应用 linregress
:
# the function to apply
def find_slope(s):
return linregress(x=df.loc[s.index, "Temperature"],
y=df.loc[s.index, "consumption"]).slope
# roll
df["slope15"] = df.Temperature.rolling(15).apply(find_slope)
我们滚动 Temperature
列只是为了获得滚动索引,即我们不使用 s
直接传递给 find_slope
但我们利用它的索引从原始数据帧中获取所需的值 df
;然后 linregress
找到斜率,
获得
>>> df
date Temperature consumption slope15
0 2020-12-01 8.0125 109.046450 NaN
1 2020-12-02 6.1500 104.494946 NaN
2 2020-12-03 5.9375 117.011582 NaN
3 2020-12-04 5.4750 109.615388 NaN
4 2020-12-05 3.8500 142.803438 NaN
5 2020-12-06 2.0500 158.638879 NaN
6 2020-12-07 0.1250 86.194107 NaN
7 2020-12-08 1.4750 121.847555 NaN
8 2020-12-09 2.4250 99.658973 NaN
10 2020-12-11 3.4250 76.806630 NaN
11 2020-12-12 7.5375 83.064948 NaN
12 2020-12-13 5.6750 82.401187 NaN
13 2020-12-14 9.9250 58.695437 NaN
14 2020-12-15 9.2875 64.574463 NaN
15 2020-12-16 7.0250 68.367383 -5.112766
16 2020-12-17 8.9125 84.487293 -5.514602
17 2020-12-18 8.6875 69.031144 -5.801025
18 2020-12-19 8.9500 65.048578 -6.062590
19 2020-12-20 8.6000 91.911185 -5.696331
20 2020-12-21 8.7625 60.022959 -5.310357
21 2020-12-22 12.7375 40.489421 -4.192492
22 2020-12-23 11.9875 43.049642 -5.542047
23 2020-12-24 6.1625 108.761981 -5.182578
24 2020-12-25 3.6875 105.727645 -5.790770
25 2020-12-26 3.8625 108.003397 -7.458946
我的数据框如下所示:
date Temperature consumption
0 2020-12-01 8.0125 109.046450
1 2020-12-02 6.1500 104.494946
2 2020-12-03 5.9375 117.011582
3 2020-12-04 5.4750 109.615388
4 2020-12-05 3.8500 142.803438
5 2020-12-06 2.0500 158.638879
6 2020-12-07 0.1250 86.194107
7 2020-12-08 1.4750 121.847555
8 2020-12-09 2.4250 99.658973
10 2020-12-11 3.4250 76.806630
11 2020-12-12 7.5375 83.064948
12 2020-12-13 5.6750 82.401187
13 2020-12-14 9.9250 58.695437
14 2020-12-15 9.2875 64.574463
15 2020-12-16 7.0250 68.367383
16 2020-12-17 8.9125 84.487293
17 2020-12-18 8.6875 69.031144
18 2020-12-19 8.9500 65.048578
19 2020-12-20 8.6000 91.911185
20 2020-12-21 8.7625 60.022959
21 2020-12-22 12.7375 40.489421
22 2020-12-23 11.9875 43.049642
23 2020-12-24 6.1625 108.761981
24 2020-12-25 3.6875 105.727645
25 2020-12-26 3.8625 108.003397
我想创建一个名为 'slope15' 的新列,其值是前 15 行的线性回归 'consumption~Temperature' 的斜率。我怎样才能做到这一点?我尝试使用 .shift(15) 和 stats.linregress() 但它没有按预期工作。
Tyvm
我不喜欢迭代,但我想不出更优雅的方式。我能够完成这项工作:
from scipy import stats
df['slope15'] = np.nan
for i in np.arange(15, df.shape[0]):
slope, intercept, r, p, se = stats.linregress(
df.loc[i-15:i, 'Temperature'],
df.loc[i-15:i, 'consumption']
)
df.loc[i, 'slope15'] = slope
您可以使用 rolling window 并应用 linregress
:
# the function to apply
def find_slope(s):
return linregress(x=df.loc[s.index, "Temperature"],
y=df.loc[s.index, "consumption"]).slope
# roll
df["slope15"] = df.Temperature.rolling(15).apply(find_slope)
我们滚动 Temperature
列只是为了获得滚动索引,即我们不使用 s
直接传递给 find_slope
但我们利用它的索引从原始数据帧中获取所需的值 df
;然后 linregress
找到斜率,
获得
>>> df
date Temperature consumption slope15
0 2020-12-01 8.0125 109.046450 NaN
1 2020-12-02 6.1500 104.494946 NaN
2 2020-12-03 5.9375 117.011582 NaN
3 2020-12-04 5.4750 109.615388 NaN
4 2020-12-05 3.8500 142.803438 NaN
5 2020-12-06 2.0500 158.638879 NaN
6 2020-12-07 0.1250 86.194107 NaN
7 2020-12-08 1.4750 121.847555 NaN
8 2020-12-09 2.4250 99.658973 NaN
10 2020-12-11 3.4250 76.806630 NaN
11 2020-12-12 7.5375 83.064948 NaN
12 2020-12-13 5.6750 82.401187 NaN
13 2020-12-14 9.9250 58.695437 NaN
14 2020-12-15 9.2875 64.574463 NaN
15 2020-12-16 7.0250 68.367383 -5.112766
16 2020-12-17 8.9125 84.487293 -5.514602
17 2020-12-18 8.6875 69.031144 -5.801025
18 2020-12-19 8.9500 65.048578 -6.062590
19 2020-12-20 8.6000 91.911185 -5.696331
20 2020-12-21 8.7625 60.022959 -5.310357
21 2020-12-22 12.7375 40.489421 -4.192492
22 2020-12-23 11.9875 43.049642 -5.542047
23 2020-12-24 6.1625 108.761981 -5.182578
24 2020-12-25 3.6875 105.727645 -5.790770
25 2020-12-26 3.8625 108.003397 -7.458946