Pandas df.resample(): 指定计算均值的NaN阈值
Pandas df.resample(): Specify NaN threshold for calculation of mean
我想使用 how=mean
方法将 pandas 数据帧从每小时重采样到 annual/daily 频率。但是,当然,一年中会丢失一些每小时的数据。
在均值也设置为 NaN 之前,如何为允许的 NaN 比率设置阈值?考虑到在文档中我找不到任何东西...
提前致谢!
这是一个使用 groupby
的简单解决方案。
# Test data
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end=start_date + 3 * Hour(), freq='H')
df.loc[na_range,'values'] = np.NaN
# grouping by day, computing the mean and the count
df = df.groupby(df.index.date).agg(['mean', 'count'])
df.columns = df.columns.droplevel()
# Populating the mean only if the number of values (count) is > to the threshold
df['values'] = np.NaN
df.loc[df['count']>=20, 'values'] = df['mean']
print(df.head)
# Result
mean count values
2015-01-01 4.947368 20 NaN
2015-01-02 5.125000 24 5.125
2015-01-03 4.875000 24 4.875
2015-01-04 5.750000 24 5.750
2015-01-05 4.875000 24 4.875
这是一个基于重采样的替代解决方案。
# Test data (taken from Romain)
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end='2015-01-01 12:00', freq='H')
df.loc[na_range,'values'] = np.NaN
# Add a column with 1 if data is not NaN, 0 if data is NaN
df['data coverage'] = (~np.isnan(df['values'])).astype(int)
df = df.resample('D').mean()
# Specify a threshold on data coverage of 80%
threshold = 0.8
df.loc[df['data coverage'] < threshold, 'values'] = np.NaN
print(df.head)
# Result
values data coverage
2015-01-01 NaN 0.458333
2015-01-02 5.708333 1.000000
2015-01-03 5.083333 1.000000
2015-01-04 4.958333 1.000000
2015-01-05 5.125000 1.000000
2015-01-06 4.791667 1.000000
2015-01-07 5.625000 1.000000
我想使用 how=mean
方法将 pandas 数据帧从每小时重采样到 annual/daily 频率。但是,当然,一年中会丢失一些每小时的数据。
在均值也设置为 NaN 之前,如何为允许的 NaN 比率设置阈值?考虑到在文档中我找不到任何东西...
提前致谢!
这是一个使用 groupby
的简单解决方案。
# Test data
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end=start_date + 3 * Hour(), freq='H')
df.loc[na_range,'values'] = np.NaN
# grouping by day, computing the mean and the count
df = df.groupby(df.index.date).agg(['mean', 'count'])
df.columns = df.columns.droplevel()
# Populating the mean only if the number of values (count) is > to the threshold
df['values'] = np.NaN
df.loc[df['count']>=20, 'values'] = df['mean']
print(df.head)
# Result
mean count values
2015-01-01 4.947368 20 NaN
2015-01-02 5.125000 24 5.125
2015-01-03 4.875000 24 4.875
2015-01-04 5.750000 24 5.750
2015-01-05 4.875000 24 4.875
这是一个基于重采样的替代解决方案。
# Test data (taken from Romain)
start_date = pd.to_datetime('2015-01-01')
pd.date_range(start=start_date, periods=365*24, freq='H')
number = 365*24
df = pd.DataFrame(np.random.randint(1,10, number),index=pd.date_range(start=start_date, periods=number, freq='H'), columns=['values'])
# Generating some NaN to simulate less values on the first day
na_range = pd.date_range(start=start_date, end='2015-01-01 12:00', freq='H')
df.loc[na_range,'values'] = np.NaN
# Add a column with 1 if data is not NaN, 0 if data is NaN
df['data coverage'] = (~np.isnan(df['values'])).astype(int)
df = df.resample('D').mean()
# Specify a threshold on data coverage of 80%
threshold = 0.8
df.loc[df['data coverage'] < threshold, 'values'] = np.NaN
print(df.head)
# Result
values data coverage
2015-01-01 NaN 0.458333
2015-01-02 5.708333 1.000000
2015-01-03 5.083333 1.000000
2015-01-04 4.958333 1.000000
2015-01-05 5.125000 1.000000
2015-01-06 4.791667 1.000000
2015-01-07 5.625000 1.000000