Python Pandas:使用两个不同大小的数据帧进行计算
Python Pandas: Calculations with Two Different Size Dataframes
我有两个不同大小的数据框,我想用它们进行计算。第一个数据集是时间序列。第二个数据集是长期的月平均值。
第一个:
year month snow_depth
0 1979 1 18.322581
1 1979 2 11.535714
2 1979 3 5.322581
3 1979 4 0.300000
4 1979 5 0.000000
5 1979 6 0.000000
6 1979 7 0.000000
7 1979 8 0.000000
8 1979 9 0.000000
9 1979 10 0.322581
10 1979 11 2.366667
11 1979 12 1.290323
12 1980 1 7.838710
13 1980 2 9.758621
14 1980 3 3.967742
15 1980 4 0.000000
16 1980 5 0.000000
17 1980 6 0.000000
18 1980 7 0.000000
19 1980 8 0.000000
20 1980 9 0.000000
21 1980 10 0.000000
22 1980 11 0.100000
23 1980 12 0.000000
24 1981 1 4.290323
25 1981 2 13.678571
26 1981 3 2.967742
27 1981 4 0.066667
28 1981 5 0.000000
29 1981 6 0.000000
30 1981 7 0.000000
31 1981 8 0.000000
32 1981 9 0.000000
33 1981 10 0.387097
34 1981 11 5.000000
35 1981 12 15.741935
第二个:
month
1 14.385199
2 15.312301
3 6.510436
4 0.268908
5 0.005819
6 0.000000
7 0.000000
8 0.000000
9 0.000000
10 0.363676
11 2.350980
12 6.730550
我用过的代码如下:
oname = 'Z:/Dan/SnowStatData/Monthly/Observations/'+str(stations[c])+'Observations.txt'
df = pd.read_table(oname,sep='\t',usecols=(0,1,2),names=['year','month','snow_depth'],na_values=-999)
df = df.replace(np.nan,0)
g = df.groupby(['month'])
oavg = g.aggregate({'snow_depth':np.average})
我试过简单地从原始数据帧中减去 oavg,但第一个数据值最终变成了 NaN(因为它似乎是按 df 的索引而不是月份)然后在索引 #12 之后,我得到始终为 NaN。有没有一种简单的方法可以通过每年并减去 oavg 以便我有一个偏离平均值的完整时间序列?
编辑:这就是我最终想要的:
year month snow_depth
0 1979 1 18.322581 - oavg[month=1]
1 1979 2 11.535714 - oavg[month=2]
2 1979 3 5.322581 - oavg[month=3]
3 1979 4 0.300000 - oavg[month=4]
4 1979 5 0.000000 - oavg[month=5]
5 1979 6 0.000000 - oavg[month=6]
6 1979 7 0.000000 - oavg[month=7]
7 1979 8 0.000000 - oavg[month=8]
8 1979 9 0.000000 - oavg[month=9]
9 1979 10 0.322581 - oavg[month=10]
10 1979 11 2.366667 - oavg[month=11]
11 1979 12 1.290323 - oavg[month=12]
然后每年重复:
12 1980 1 7.838710 - oavg[month=1]
13 1980 2 9.758621 - oavg[month=2]
14 1980 3 3.967742 - oavg[month=3]
等等等
我先把df2的列重命名为'avg':
df2.columns = ['avg']
然后,将平均降雪量合并到df1,然后计算差值:
df3 = df1.merge(df2, how='left', left_on='month', right_index=True, suffixes=('', '_rhs'))
df3['difference'] = df3['snow_depth'] - df3['avg']
>>> df3
year month snow_depth avg difference
0 1979 1 18.322581 14.385199 3.937382
1 1979 2 11.535714 15.312301 -3.776587
2 1979 3 5.322581 6.510436 -1.187855
3 1979 4 0.300000 0.268908 0.031092
4 1979 5 0.000000 0.005819 -0.005819
5 1979 6 0.000000 0.000000 0.000000
6 1979 7 0.000000 0.000000 0.000000
7 1979 8 0.000000 0.000000 0.000000
8 1979 9 0.000000 0.000000 0.000000
9 1979 10 0.322581 0.363676 -0.041095
10 1979 11 2.366667 2.350980 0.015687
11 1979 12 1.290323 6.730550 -5.440227
12 1980 1 7.838710 14.385199 -6.546489
13 1980 2 9.758621 15.312301 -5.553680
14 1980 3 3.967742 6.510436 -2.542694
15 1980 4 0.000000 0.268908 -0.268908
16 1980 5 0.000000 0.005819 -0.005819
17 1980 6 0.000000 0.000000 0.000000
18 1980 7 0.000000 0.000000 0.000000
19 1980 8 0.000000 0.000000 0.000000
20 1980 9 0.000000 0.000000 0.000000
21 1980 10 0.000000 0.363676 -0.363676
22 1980 11 0.100000 2.350980 -2.250980
23 1980 12 0.000000 6.730550 -6.730550
24 1981 1 4.290323 14.385199 -10.094876
25 1981 2 13.678571 15.312301 -1.633730
26 1981 3 2.967742 6.510436 -3.542694
27 1981 4 0.066667 0.268908 -0.202241
28 1981 5 0.000000 0.005819 -0.005819
29 1981 6 0.000000 0.000000 0.000000
30 1981 7 0.000000 0.000000 0.000000
31 1981 8 0.000000 0.000000 0.000000
32 1981 9 0.000000 0.000000 0.000000
33 1981 10 0.387097 0.363676 0.023421
34 1981 11 5.000000 2.350980 2.649020
35 1981 12 15.741935 6.730550 9.011385
试一试。使用地图直接从您的一系列平均值中提取
df["diff"] = df["snow_depth"] - df["month"].map(nameofyourseries)
year month snow_depth diff
0 1979 1 18.322581 3.937382
1 1979 2 11.535714 -3.776587
2 1979 3 5.322581 -1.187855
3 1979 4 0.300000 0.031092
4 1979 5 0.000000 -0.005819
5 1979 6 0.000000 0.000000
6 1979 7 0.000000 0.000000
7 1979 8 0.000000 0.000000
8 1979 9 0.000000 0.000000
9 1979 10 0.322581 -0.041095
10 1979 11 2.366667 0.015687
11 1979 12 1.290323 -5.440227
12 1980 1 7.838710 -6.546489
13 1980 2 9.758621 -5.553680
14 1980 3 3.967742 -2.542694
15 1980 4 0.000000 -0.268908
16 1980 5 0.000000 -0.005819
17 1980 6 0.000000 0.000000
18 1980 7 0.000000 0.000000
19 1980 8 0.000000 0.000000
20 1980 9 0.000000 0.000000
21 1980 10 0.000000 -0.363676
22 1980 11 0.100000 -2.250980
23 1980 12 0.000000 -6.730550
24 1981 1 4.290323 -10.094876
25 1981 2 13.678571 -1.633730
26 1981 3 2.967742 -3.542694
27 1981 4 0.066667 -0.202241
28 1981 5 0.000000 -0.005819
29 1981 6 0.000000 0.000000
30 1981 7 0.000000 0.000000
31 1981 8 0.000000 0.000000
32 1981 9 0.000000 0.000000
33 1981 10 0.387097 0.023421
34 1981 11 5.000000 2.649020
35 1981 12 15.741935 9.011385
我有两个不同大小的数据框,我想用它们进行计算。第一个数据集是时间序列。第二个数据集是长期的月平均值。
第一个:
year month snow_depth
0 1979 1 18.322581
1 1979 2 11.535714
2 1979 3 5.322581
3 1979 4 0.300000
4 1979 5 0.000000
5 1979 6 0.000000
6 1979 7 0.000000
7 1979 8 0.000000
8 1979 9 0.000000
9 1979 10 0.322581
10 1979 11 2.366667
11 1979 12 1.290323
12 1980 1 7.838710
13 1980 2 9.758621
14 1980 3 3.967742
15 1980 4 0.000000
16 1980 5 0.000000
17 1980 6 0.000000
18 1980 7 0.000000
19 1980 8 0.000000
20 1980 9 0.000000
21 1980 10 0.000000
22 1980 11 0.100000
23 1980 12 0.000000
24 1981 1 4.290323
25 1981 2 13.678571
26 1981 3 2.967742
27 1981 4 0.066667
28 1981 5 0.000000
29 1981 6 0.000000
30 1981 7 0.000000
31 1981 8 0.000000
32 1981 9 0.000000
33 1981 10 0.387097
34 1981 11 5.000000
35 1981 12 15.741935
第二个:
month
1 14.385199
2 15.312301
3 6.510436
4 0.268908
5 0.005819
6 0.000000
7 0.000000
8 0.000000
9 0.000000
10 0.363676
11 2.350980
12 6.730550
我用过的代码如下:
oname = 'Z:/Dan/SnowStatData/Monthly/Observations/'+str(stations[c])+'Observations.txt'
df = pd.read_table(oname,sep='\t',usecols=(0,1,2),names=['year','month','snow_depth'],na_values=-999)
df = df.replace(np.nan,0)
g = df.groupby(['month'])
oavg = g.aggregate({'snow_depth':np.average})
我试过简单地从原始数据帧中减去 oavg,但第一个数据值最终变成了 NaN(因为它似乎是按 df 的索引而不是月份)然后在索引 #12 之后,我得到始终为 NaN。有没有一种简单的方法可以通过每年并减去 oavg 以便我有一个偏离平均值的完整时间序列?
编辑:这就是我最终想要的:
year month snow_depth
0 1979 1 18.322581 - oavg[month=1]
1 1979 2 11.535714 - oavg[month=2]
2 1979 3 5.322581 - oavg[month=3]
3 1979 4 0.300000 - oavg[month=4]
4 1979 5 0.000000 - oavg[month=5]
5 1979 6 0.000000 - oavg[month=6]
6 1979 7 0.000000 - oavg[month=7]
7 1979 8 0.000000 - oavg[month=8]
8 1979 9 0.000000 - oavg[month=9]
9 1979 10 0.322581 - oavg[month=10]
10 1979 11 2.366667 - oavg[month=11]
11 1979 12 1.290323 - oavg[month=12]
然后每年重复:
12 1980 1 7.838710 - oavg[month=1]
13 1980 2 9.758621 - oavg[month=2]
14 1980 3 3.967742 - oavg[month=3]
等等等
我先把df2的列重命名为'avg':
df2.columns = ['avg']
然后,将平均降雪量合并到df1,然后计算差值:
df3 = df1.merge(df2, how='left', left_on='month', right_index=True, suffixes=('', '_rhs'))
df3['difference'] = df3['snow_depth'] - df3['avg']
>>> df3
year month snow_depth avg difference
0 1979 1 18.322581 14.385199 3.937382
1 1979 2 11.535714 15.312301 -3.776587
2 1979 3 5.322581 6.510436 -1.187855
3 1979 4 0.300000 0.268908 0.031092
4 1979 5 0.000000 0.005819 -0.005819
5 1979 6 0.000000 0.000000 0.000000
6 1979 7 0.000000 0.000000 0.000000
7 1979 8 0.000000 0.000000 0.000000
8 1979 9 0.000000 0.000000 0.000000
9 1979 10 0.322581 0.363676 -0.041095
10 1979 11 2.366667 2.350980 0.015687
11 1979 12 1.290323 6.730550 -5.440227
12 1980 1 7.838710 14.385199 -6.546489
13 1980 2 9.758621 15.312301 -5.553680
14 1980 3 3.967742 6.510436 -2.542694
15 1980 4 0.000000 0.268908 -0.268908
16 1980 5 0.000000 0.005819 -0.005819
17 1980 6 0.000000 0.000000 0.000000
18 1980 7 0.000000 0.000000 0.000000
19 1980 8 0.000000 0.000000 0.000000
20 1980 9 0.000000 0.000000 0.000000
21 1980 10 0.000000 0.363676 -0.363676
22 1980 11 0.100000 2.350980 -2.250980
23 1980 12 0.000000 6.730550 -6.730550
24 1981 1 4.290323 14.385199 -10.094876
25 1981 2 13.678571 15.312301 -1.633730
26 1981 3 2.967742 6.510436 -3.542694
27 1981 4 0.066667 0.268908 -0.202241
28 1981 5 0.000000 0.005819 -0.005819
29 1981 6 0.000000 0.000000 0.000000
30 1981 7 0.000000 0.000000 0.000000
31 1981 8 0.000000 0.000000 0.000000
32 1981 9 0.000000 0.000000 0.000000
33 1981 10 0.387097 0.363676 0.023421
34 1981 11 5.000000 2.350980 2.649020
35 1981 12 15.741935 6.730550 9.011385
试一试。使用地图直接从您的一系列平均值中提取
df["diff"] = df["snow_depth"] - df["month"].map(nameofyourseries)
year month snow_depth diff
0 1979 1 18.322581 3.937382
1 1979 2 11.535714 -3.776587
2 1979 3 5.322581 -1.187855
3 1979 4 0.300000 0.031092
4 1979 5 0.000000 -0.005819
5 1979 6 0.000000 0.000000
6 1979 7 0.000000 0.000000
7 1979 8 0.000000 0.000000
8 1979 9 0.000000 0.000000
9 1979 10 0.322581 -0.041095
10 1979 11 2.366667 0.015687
11 1979 12 1.290323 -5.440227
12 1980 1 7.838710 -6.546489
13 1980 2 9.758621 -5.553680
14 1980 3 3.967742 -2.542694
15 1980 4 0.000000 -0.268908
16 1980 5 0.000000 -0.005819
17 1980 6 0.000000 0.000000
18 1980 7 0.000000 0.000000
19 1980 8 0.000000 0.000000
20 1980 9 0.000000 0.000000
21 1980 10 0.000000 -0.363676
22 1980 11 0.100000 -2.250980
23 1980 12 0.000000 -6.730550
24 1981 1 4.290323 -10.094876
25 1981 2 13.678571 -1.633730
26 1981 3 2.967742 -3.542694
27 1981 4 0.066667 -0.202241
28 1981 5 0.000000 -0.005819
29 1981 6 0.000000 0.000000
30 1981 7 0.000000 0.000000
31 1981 8 0.000000 0.000000
32 1981 9 0.000000 0.000000
33 1981 10 0.387097 0.023421
34 1981 11 5.000000 2.649020
35 1981 12 15.741935 9.011385