Python:重采样并前向填充到最近一个月
Python: Resampling and forward-filling to most recent month
我有一个大数据集,其中包含每个 property_id 每个月发生费用的费用,根据以下数据集。
property_id period amount
1 2016-07-01 105908.20
1 2016-08-01 0.00
2 2016-08-01 114759.40
3 2014-05-01 -934.00
3 2014-06-01 -845.95
3 2017-12-01 92175.77
4 2015-09-01 -1859.75
4 2015-12-01 1859.75
4 2017-12-01 130105.00
5 2014-07-01 -6929.58
我想创建一个累计总和,按 property_id 分组,并每个月结转,从 property_id 的第一个月到最近的整月。
我试过下面的方法,其中我使用 property_id 重采样并尝试前向填充,但它给出了一个错误
cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))
TypeError: 'method' object is not subscriptable
示例输出如下:
> property_id period amount
> 1 2016-07-01 105908.20
> 1 2016-08-01 105908.20
> 1 2016-09-01 105908.20
> 1 2016-10-01 105908.20
> ...
> 1 2019-07-01 105908.20
> 2 2016-08-01 114759.40
> 2 2016-09-01 114759.40
> 2 2016-10-01 114759.40
> ...
> 2 2019-07-01 114759.40
> 3 2014-05-01 -934.00
> 3 2014-06-01 -1779.95
> 3 2014-07-01 -1779.95
> 3 2014-08-01 -1779.95
> ...
> 3 2017-12-01 90395.82
> 3 2018-01-01 90395.82
> 3 2018-02-01 90395.82
> 3 2018-03-01 90395.82
> ...
> 3 2019-07-01 90395.82
> 4 2015-09-01 -1859.75
> 4 2015-10-01 -1859.75
> 4 2015-11-01 -1859.75
> 4 2015-12-01 0
> 4 2016-01-01 0
> ...
> 4 2017-11-01 0
> 4 2017-12-01 130105.00
> 4 2018-01-01 130105.00
> ...
> 4 2019-07-01 130105.00
> 5 2014-07-01 -6929.58
> 5 2014-08-01 -6929.58
> 5 2014-09-01 -6929.58
> ...
> 5 2019-07-01 -6929.58
任何帮助都会很棒。
谢谢!
先创建DatetimeIndex
,然后将groupby
与resample
一起使用:
df['period'] = pd.to_datetime(df['period'])
df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)
property_id amount
property_id period
1 2016-07-31 1 105908.20
2016-08-31 1 0.00
2 2016-08-31 2 114759.40
3 2014-05-31 3 -934.00
2014-06-30 3 -845.95
... ...
4 2017-09-30 4 1859.75
2017-10-31 4 1859.75
2017-11-30 4 1859.75
2017-12-31 4 130105.00
5 2014-07-31 5 -6929.58
[76 rows x 2 columns]
编辑:想法是通过过滤 property_id
的最后一个值并按条件分配月份来创建新的 DataFrame,然后附加到原始数据并使用上面的解决方案:
df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])
last = pd.to_datetime('now').floor('d')
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()
last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00
df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
property_id period amount
1 1 2019-07-01 0.00
2 2 2019-07-01 114759.40
5 3 2019-07-01 92175.77
8 4 2019-07-01 130105.00
9 5 2019-07-01 -6929.58
df = pd.concat([df, df1])
df1 = (df.set_index('period')
.groupby('property_id')['amount']
.resample('MS')
.asfreq(fill_value=0)
.groupby(level=0)
.cumsum())
print (df1)
property_id period
1 2016-07-01 105908.20
2016-08-01 105908.20
2016-09-01 105908.20
2016-10-01 105908.20
2016-11-01 105908.20
5 2019-03-01 -394986.06
2019-04-01 -401915.64
2019-05-01 -408845.22
2019-06-01 -415774.80
2019-07-01 -422704.38
Name: amount, Length: 244, dtype: float64
我有一个大数据集,其中包含每个 property_id 每个月发生费用的费用,根据以下数据集。
property_id period amount
1 2016-07-01 105908.20
1 2016-08-01 0.00
2 2016-08-01 114759.40
3 2014-05-01 -934.00
3 2014-06-01 -845.95
3 2017-12-01 92175.77
4 2015-09-01 -1859.75
4 2015-12-01 1859.75
4 2017-12-01 130105.00
5 2014-07-01 -6929.58
我想创建一个累计总和,按 property_id 分组,并每个月结转,从 property_id 的第一个月到最近的整月。
我试过下面的方法,其中我使用 property_id 重采样并尝试前向填充,但它给出了一个错误
cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))
TypeError: 'method' object is not subscriptable
示例输出如下:
> property_id period amount
> 1 2016-07-01 105908.20
> 1 2016-08-01 105908.20
> 1 2016-09-01 105908.20
> 1 2016-10-01 105908.20
> ...
> 1 2019-07-01 105908.20
> 2 2016-08-01 114759.40
> 2 2016-09-01 114759.40
> 2 2016-10-01 114759.40
> ...
> 2 2019-07-01 114759.40
> 3 2014-05-01 -934.00
> 3 2014-06-01 -1779.95
> 3 2014-07-01 -1779.95
> 3 2014-08-01 -1779.95
> ...
> 3 2017-12-01 90395.82
> 3 2018-01-01 90395.82
> 3 2018-02-01 90395.82
> 3 2018-03-01 90395.82
> ...
> 3 2019-07-01 90395.82
> 4 2015-09-01 -1859.75
> 4 2015-10-01 -1859.75
> 4 2015-11-01 -1859.75
> 4 2015-12-01 0
> 4 2016-01-01 0
> ...
> 4 2017-11-01 0
> 4 2017-12-01 130105.00
> 4 2018-01-01 130105.00
> ...
> 4 2019-07-01 130105.00
> 5 2014-07-01 -6929.58
> 5 2014-08-01 -6929.58
> 5 2014-09-01 -6929.58
> ...
> 5 2019-07-01 -6929.58
任何帮助都会很棒。
谢谢!
先创建DatetimeIndex
,然后将groupby
与resample
一起使用:
df['period'] = pd.to_datetime(df['period'])
df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)
property_id amount
property_id period
1 2016-07-31 1 105908.20
2016-08-31 1 0.00
2 2016-08-31 2 114759.40
3 2014-05-31 3 -934.00
2014-06-30 3 -845.95
... ...
4 2017-09-30 4 1859.75
2017-10-31 4 1859.75
2017-11-30 4 1859.75
2017-12-31 4 130105.00
5 2014-07-31 5 -6929.58
[76 rows x 2 columns]
编辑:想法是通过过滤 property_id
的最后一个值并按条件分配月份来创建新的 DataFrame,然后附加到原始数据并使用上面的解决方案:
df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])
last = pd.to_datetime('now').floor('d')
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()
last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00
df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
property_id period amount
1 1 2019-07-01 0.00
2 2 2019-07-01 114759.40
5 3 2019-07-01 92175.77
8 4 2019-07-01 130105.00
9 5 2019-07-01 -6929.58
df = pd.concat([df, df1])
df1 = (df.set_index('period')
.groupby('property_id')['amount']
.resample('MS')
.asfreq(fill_value=0)
.groupby(level=0)
.cumsum())
print (df1)
property_id period
1 2016-07-01 105908.20
2016-08-01 105908.20
2016-09-01 105908.20
2016-10-01 105908.20
2016-11-01 105908.20
5 2019-03-01 -394986.06
2019-04-01 -401915.64
2019-05-01 -408845.22
2019-06-01 -415774.80
2019-07-01 -422704.38
Name: amount, Length: 244, dtype: float64