groupby 的标准偏差(多列)Pandas

Standard deviation with groupby(multiple columns) Pandas

我正在处理来自加州空气资源委员会的数据。

site,monitor,date,start_hour,value,variable,units,quality,prelim,name 
5407,t,2014-01-01,0,3.00,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,1,1.54,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,2,3.76,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,3,5.98,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,4,8.09,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,5,12.05,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
5407,t,2014-01-01,6,12.55,PM25HR,Micrograms/Cubic Meter ( ug/m<sup>3</sup> ),0,y,Bombay Beach 
...

df = pd.concat([pd.read_csv(file, header = 0) for file in f]) #merges all files into one dataframe
df.dropna(axis = 0, how = "all", subset = ['start_hour', 'variable'],
          inplace = True) #drops bottom columns without data in them, NaN

df.start_hour = pd.to_timedelta(df['start_hour'], unit = 'h')
df.date = pd.to_datetime(df.date)
df['datetime'] = df.date + df.start_hour
df.drop(columns=['date', 'start_hour'], inplace=True)
df['month'] = df.datetime.dt.month
df['day'] = df.datetime.dt.day
df['year'] = df.datetime.dt.year
df.set_index('datetime', inplace = True)
df =  df.rename(columns={'value':'conc'})

我有多年的每小时 PM2.5 浓度数据,我正在尝试准备图表以显示多年来的平均每月浓度(每个月的图表不同)。这是我迄今为止创建的图表的图像。 [![孟买海滩][1]][1] 但是,我想将误差线添加到平均浓度线,但在尝试计算标准差时遇到问题。我创建了一个新的数据框 d_avg,其中包括年、月、日和 PM2.5 的平均浓度;这是一些数据。

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
   year  month  day      conc
0  2014      1    1  9.644583
1  2014      1    2  4.945652
2  2014      1    3  4.345238
3  2014      1    4  5.047917
4  2014      1    5  5.212857
5  2014      1    6  2.095714

在此之后,我找到了月平均值 m_avg 并创建了一个日期时间索引来绘制日期时间与月平均值 conc(参见上文,黑线)。

m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
m_avg['datetime'] = pd.to_datetime(m_avg.year.astype(str) + m_avg.month.astype(str), format='%Y%m') + MonthEnd(1)
[In]: m_avg.head(6)
[Out]:
   year  month      conc   datetime
0  2014      1  4.330985 2014-01-31
1  2014      2  2.280096 2014-02-28
2  2014      3  4.464622 2014-03-31
3  2014      4  6.583759 2014-04-30
4  2014      5  9.069353 2014-05-31
5  2014      6  9.982330 2014-06-30

现在我想计算 d_avg 浓度的标准偏差,我已经尝试了多种方法:

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].agg(np.std)

sd = d_avg['conc'].apply(lambda x: x.std())

但是,每次尝试都给我留下了数据帧中的相同错误。我无法绘制标准偏差,因为我相信它也采用了年份和月份的标准偏差,我正试图以此为依据对数据进行分组。这是我生成的数据框 sd 的样子:

        year     month        sd
0  44.877611  1.000000  1.795868
1  44.877611  1.414214  2.355055
2  44.877611  1.732051  2.597531
3  44.877611  2.000000  2.538749
4  44.877611  2.236068  5.456785
5  44.877611  2.449490  3.315546

请帮帮我! [1]: https://i.stack.imgur.com/ueVrG.png

我试图重现你的错误,它对我来说很好用。这是我的完整代码示例,除了原始数据帧的生成之外,它与您的代码示例几乎完全相同。所以我怀疑那部分代码。你能提供创建数据框的代码吗?

import pandas as pd

columns = ['year', 'month', 'day', 'conc']
data = [[2014, 1, 1, 2.0],
        [2014, 1, 1, 4.0],
        [2014, 1, 2, 6.0],
        [2014, 1, 2, 8.0],
        [2014, 2, 1, 2.0],
        [2014, 2, 1, 6.0],
        [2014, 2, 2, 10.0],
        [2014, 2, 2, 14.0]]

df = pd.DataFrame(data, columns=columns)
d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year', 'month'], as_index=False)['conc'].mean()
m_std = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std()

print(f'Concentrations:\n{df}\n')
print(f'Daily Average:\n{d_avg}\n')
print(f'Monthly Average:\n{m_avg}\n')
print(f'Standard Deviation:\n{m_std}\n')

输出:

Concentrations:
   year  month  day  conc
0  2014      1    1   2.0
1  2014      1    1   4.0
2  2014      1    2   6.0
3  2014      1    2   8.0
4  2014      2    1   2.0
5  2014      2    1   6.0
6  2014      2    2  10.0
7  2014      2    2  14.0

Daily Average:
   year  month  day  conc
0  2014      1    1   3.0
1  2014      1    2   7.0
2  2014      2    1   4.0
3  2014      2    2  12.0

Monthly Average:
   year  month  conc
0  2014      1   5.0
1  2014      2   8.0

Monthly Standard Deviation:
   year  month      conc
0  2014      1  2.828427
1  2014      2  5.656854

我决定绕过我的问题,因为我无法弄清楚是什么导致了这个问题。我合并了 m_avg 和 sd 数据框,并删除了导致我出现问题的年和月列。请参阅下面的代码,大量重命名。

d_avg = df.groupby(['year', 'month', 'day'], as_index=False)['conc'].mean()
m_avg = d_avg.groupby(['year','month'], as_index=False)['conc'].mean()
sd = d_avg.groupby(['year', 'month'], as_index=False)['conc'].std(ddof=0) 
sd = sd.rename(columns={"conc":"sd", "year":"wrongyr", "month":"wrongmth"})
m_avg_sd = pd.concat([m_avg, sd], axis = 1)
m_avg_sd.drop(columns=['wrongyr', 'wrongmth'], inplace = True)
m_avg_sd['datetime'] = pd.to_datetime(m_avg_sd.year.astype(str) + m_avg_sd.month.astype(str), format='%Y%m') + MonthEnd(1)

这是新的数据框:

m_avg_sd.head(5)
Out[2]: 
   year  month       conc         sd   datetime
0  2009      1  48.350105  18.394192 2009-01-31
1  2009      2  21.929383  16.293645 2009-02-28
2  2009      3  15.094729   6.821124 2009-03-31
3  2009      4  12.021009   4.391219 2009-04-30
4  2009      5  13.449100   4.081734 2009-05-31