pandas 中过去 n 个日期的真值总和
Sum of true values over past n dates in pandas
我有一个包含几千行地理列的数据框,response_dates 和 True/False 用于 in_compliance。
df = pd.DataFrame( {
"geography" : ["Baltimore", "Frederick", "Annapolis", "Hagerstown", "Rockville" , "Salisbury","Towson","Bowie"] ,
"response_date" : ["2018-03-31", "2018-03-30", "2018-03-28", "2018-03-28", "2018-04-02", "2018-03-30","2018-04-07","2018-04-02"],
"in_compliance" : [True, True, False, True, False, True, False, True]})
我想添加一列,表示 response_date 列中最近四个日期的 True 值的数量,包括该行的 response_date。所需输出的示例:
geography response_date in_compliance Past_4_dates_sum_of_true
Baltimore 2018-03-24 True 1
Baltimore 2018-03-25 False 1
Baltimore 2018-03-26 False 1
Baltimore 2018-03-27 False 1
Baltimore 2018-03-30 False 0
Baltimore 2018-03-31 True 1
Baltimore 2018-04-01 True 2
Baltimore 2018-04-02 True 3
Baltimore 2018-04-03 False 3
Baltimore 2018-04-06 True 3
Baltimore 2018-04-07 True 3
Baltimore 2018-04-08 False 2
我尝试过不同的分组和滚动方法。但是我得到的结果不是我期望和需要的。
df.groupby('city').resample('d').sum().fillna(0).groupby('city').rolling(4,min_periods=1).sum()
这是我采用的另一种方法:
df1 = df.groupby(['city']).apply(lambda x: x.set_index('response_date').resample('1D').first())
df2 = df1.groupby(level=0)['in_compliance']\
.apply(lambda x: x.shift().rolling(min_periods=1,window=4).count())\
.reset_index(name='Past_4_dates_sum_of_true')
更简单:
df['Past_4_dates_sum_of_true'] = df.rolling(4, min_periods=1)['in_compliance'].sum().astype(int)
输出:
geography response_date in_compliance Past_4_dates_sum_of_true
0 Baltimore 2018-03-24 True 1
1 Baltimore 2018-03-25 False 1
2 Baltimore 2018-03-26 False 1
3 Baltimore 2018-03-27 False 1
4 Baltimore 2018-03-30 False 0
5 Baltimore 2018-03-31 True 1
6 Baltimore 2018-04-01 True 2
7 Baltimore 2018-04-02 True 3
8 Baltimore 2018-04-03 False 3
9 Baltimore 2018-04-06 True 3
10 Baltimore 2018-04-07 True 3
11 Baltimore 2018-04-08 False 2
我认为您可以将 rolling
与 4day
与 4d
一起使用:
df = df.sort_values(['city','response_date'])
df = df.set_index('response_date')
df['new'] = (df.groupby('city')['in_compliance']
.rolling('4d',min_periods=1)
.sum()
.astype(int)
.reset_index(level=0, drop=True))
df = df.reset_index()
print (df)
response_date city in_compliance Past_4_dates_sum_of_true new
0 2018-03-24 Baltimore True 1 1
1 2018-03-25 Baltimore False 1 1
2 2018-03-26 Baltimore False 1 1
3 2018-03-27 Baltimore False 1 1
4 2018-03-30 Baltimore False 0 0
5 2018-03-31 Baltimore True 1 1
6 2018-04-01 Baltimore True 2 2
7 2018-04-02 Baltimore True 3 3
8 2018-04-03 Baltimore False 3 3
9 2018-04-06 Baltimore True 3 1 <-difference because 2018-04-05 missing
10 2018-04-07 Baltimore True 3 2
11 2018-04-08 Baltimore False 2 2
我有一个包含几千行地理列的数据框,response_dates 和 True/False 用于 in_compliance。
df = pd.DataFrame( {
"geography" : ["Baltimore", "Frederick", "Annapolis", "Hagerstown", "Rockville" , "Salisbury","Towson","Bowie"] ,
"response_date" : ["2018-03-31", "2018-03-30", "2018-03-28", "2018-03-28", "2018-04-02", "2018-03-30","2018-04-07","2018-04-02"],
"in_compliance" : [True, True, False, True, False, True, False, True]})
我想添加一列,表示 response_date 列中最近四个日期的 True 值的数量,包括该行的 response_date。所需输出的示例:
geography response_date in_compliance Past_4_dates_sum_of_true
Baltimore 2018-03-24 True 1
Baltimore 2018-03-25 False 1
Baltimore 2018-03-26 False 1
Baltimore 2018-03-27 False 1
Baltimore 2018-03-30 False 0
Baltimore 2018-03-31 True 1
Baltimore 2018-04-01 True 2
Baltimore 2018-04-02 True 3
Baltimore 2018-04-03 False 3
Baltimore 2018-04-06 True 3
Baltimore 2018-04-07 True 3
Baltimore 2018-04-08 False 2
我尝试过不同的分组和滚动方法。但是我得到的结果不是我期望和需要的。
df.groupby('city').resample('d').sum().fillna(0).groupby('city').rolling(4,min_periods=1).sum()
这是我采用的另一种方法:
df1 = df.groupby(['city']).apply(lambda x: x.set_index('response_date').resample('1D').first())
df2 = df1.groupby(level=0)['in_compliance']\
.apply(lambda x: x.shift().rolling(min_periods=1,window=4).count())\
.reset_index(name='Past_4_dates_sum_of_true')
更简单:
df['Past_4_dates_sum_of_true'] = df.rolling(4, min_periods=1)['in_compliance'].sum().astype(int)
输出:
geography response_date in_compliance Past_4_dates_sum_of_true
0 Baltimore 2018-03-24 True 1
1 Baltimore 2018-03-25 False 1
2 Baltimore 2018-03-26 False 1
3 Baltimore 2018-03-27 False 1
4 Baltimore 2018-03-30 False 0
5 Baltimore 2018-03-31 True 1
6 Baltimore 2018-04-01 True 2
7 Baltimore 2018-04-02 True 3
8 Baltimore 2018-04-03 False 3
9 Baltimore 2018-04-06 True 3
10 Baltimore 2018-04-07 True 3
11 Baltimore 2018-04-08 False 2
我认为您可以将 rolling
与 4day
与 4d
一起使用:
df = df.sort_values(['city','response_date'])
df = df.set_index('response_date')
df['new'] = (df.groupby('city')['in_compliance']
.rolling('4d',min_periods=1)
.sum()
.astype(int)
.reset_index(level=0, drop=True))
df = df.reset_index()
print (df)
response_date city in_compliance Past_4_dates_sum_of_true new
0 2018-03-24 Baltimore True 1 1
1 2018-03-25 Baltimore False 1 1
2 2018-03-26 Baltimore False 1 1
3 2018-03-27 Baltimore False 1 1
4 2018-03-30 Baltimore False 0 0
5 2018-03-31 Baltimore True 1 1
6 2018-04-01 Baltimore True 2 2
7 2018-04-02 Baltimore True 3 3
8 2018-04-03 Baltimore False 3 3
9 2018-04-06 Baltimore True 3 1 <-difference because 2018-04-05 missing
10 2018-04-07 Baltimore True 3 2
11 2018-04-08 Baltimore False 2 2