Python Pandas 如果日期介于两个日期之间,则对列中的值求和
Python Pandas Sum Values in Columns If date between 2 dates
我有一个数据框 df
可以用这个创建:
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
'score1':[5,7,3,2,9,3,8,3],
'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)
And looks like this:
id date1 date2 score1 score2
0 1 2016-01-01 2016-01-05 5 1
1 1 2016-01-02 2016-01-03 7 3
2 1 2016-01-03 2016-01-05 3 0
3 1 2016-01-04 2016-01-05 2 5
4 2 2016-01-02 2016-01-04 9 2
5 2 2016-01-04 2016-01-05 3 20
6 2 2016-01-03 2016-01-04 8 7
7 2 2016-01-01 2016-01-01 3 7
我需要做的是为 score1
和 score2
中的每一个创建一个列,这将创建两个列,分别对 score1
和 score2
的值求和, 基于 usedate
是否在 date1
和 date2
之间。 usedate
是通过获取介于 date1
最小值和 date2
最大值之间并包括在内的所有日期创建的。我用它来创建日期范围:
drange=pd.date_range(df.date1.min(),df.date2.max())
生成的数据帧 newdf
应如下所示:
usedate score1sum score2sum
0 2016-01-01 8 8
1 2016-01-02 21 6
2 2016-01-03 32 13
3 2016-01-04 30 35
4 2016-01-05 13 26
为了澄清,在 usedate
2016-01-01 上,score1sum
是 8,这是通过查看 df
中的行计算得出的,其中 2016-01-01 介于并包括 date1
和 date2
,它们对 row0(5) 和 row8(3) 求和。在 usedate
2016-01-04 上,score2sum
是 35,这是通过查看 df
中的行计算得出的,其中 2016-01-04 介于 date1
和date2
,对 row0(1)、row3(0)、row4(5)、row5(2)、row6(20)、row7(7) 求和。
也许是某种 groupby
,或者 melt
然后 groupby
?
方法一:列表解析
这很不雅观,但是,嘿,它有效! (编辑:在下面添加了第二种方法。)
# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]
# output
newdf
usedate score1sum score2sum
2016-01-01 8 8
2016-01-02 21 6
2016-01-03 32 13
2016-01-04 30 35
2016-01-05 13 26
方法二:辅助函数transform
(或apply
)
newdf = pd.DataFrame(data=drange, columns=['usedate'])
def sum_scores(d):
return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()
# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)
# newdf is same to above
时间比较
# Jupyter timeit cell magic
%%timeit
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]
100 loops, best of 3: 10.4 ms per loop
# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)
100 loops, best of 3: 8.51 ms per loop
您可以将 apply
与 lambda 函数一起使用:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])
df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) &
(x.name <= df.date2),
['score1','score2']].sum(), axis=1)
df1.rename_axis('usedate').reset_index()
输出:
usedate score1sum score2sum
0 2016-01-01 8 8
1 2016-01-02 21 6
2 2016-01-03 32 13
3 2016-01-04 30 35
4 2016-01-05 13 26
conditional_join from pyjanitor 可能对 abstraction/convenience:
有帮助
# pip install pyjanitor
import pandas as pd
import janitor as jn
drange = pd.DataFrame(drange, columns=['dates'])
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
(drange.conditional_join(df,
('dates', 'date1', '>='),
('dates', 'date2', '<='))
.droplevel(0, 1)
.select_columns('dates', 'score*')
.groupby('dates')
.sum()
.add_suffix('num')
)
score1num score2num
dates
2016-01-01 8 8
2016-01-02 21 6
2016-01-03 32 13
2016-01-04 30 35
2016-01-05 13 26
我有一个数据框 df
可以用这个创建:
data={'id':[1,1,1,1,2,2,2,2],
'date1':[datetime.date(2016,1,1),datetime.date(2016,1,2),datetime.date(2016,1,3),datetime.date(2016,1,4),
datetime.date(2016,1,2),datetime.date(2016,1,4),datetime.date(2016,1,3),datetime.date(2016,1,1)],
'date2':[datetime.date(2016,1,5),datetime.date(2016,1,3),datetime.date(2016,1,5),datetime.date(2016,1,5),
datetime.date(2016,1,4),datetime.date(2016,1,5),datetime.date(2016,1,4),datetime.date(2016,1,1)],
'score1':[5,7,3,2,9,3,8,3],
'score2':[1,3,0,5,2,20,7,7]}
df=pd.DataFrame.from_dict(data)
And looks like this:
id date1 date2 score1 score2
0 1 2016-01-01 2016-01-05 5 1
1 1 2016-01-02 2016-01-03 7 3
2 1 2016-01-03 2016-01-05 3 0
3 1 2016-01-04 2016-01-05 2 5
4 2 2016-01-02 2016-01-04 9 2
5 2 2016-01-04 2016-01-05 3 20
6 2 2016-01-03 2016-01-04 8 7
7 2 2016-01-01 2016-01-01 3 7
我需要做的是为 score1
和 score2
中的每一个创建一个列,这将创建两个列,分别对 score1
和 score2
的值求和, 基于 usedate
是否在 date1
和 date2
之间。 usedate
是通过获取介于 date1
最小值和 date2
最大值之间并包括在内的所有日期创建的。我用它来创建日期范围:
drange=pd.date_range(df.date1.min(),df.date2.max())
生成的数据帧 newdf
应如下所示:
usedate score1sum score2sum
0 2016-01-01 8 8
1 2016-01-02 21 6
2 2016-01-03 32 13
3 2016-01-04 30 35
4 2016-01-05 13 26
为了澄清,在 usedate
2016-01-01 上,score1sum
是 8,这是通过查看 df
中的行计算得出的,其中 2016-01-01 介于并包括 date1
和 date2
,它们对 row0(5) 和 row8(3) 求和。在 usedate
2016-01-04 上,score2sum
是 35,这是通过查看 df
中的行计算得出的,其中 2016-01-04 介于 date1
和date2
,对 row0(1)、row3(0)、row4(5)、row5(2)、row6(20)、row7(7) 求和。
也许是某种 groupby
,或者 melt
然后 groupby
?
方法一:列表解析
这很不雅观,但是,嘿,它有效! (编辑:在下面添加了第二种方法。)
# Convert datetime.date to pandas timestamps for easier comparisons
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
# solution
newdf = pd.DataFrame(data=drange, columns=['usedate'])
# for each usedate ud, get all df rows whose dates contain ud,
# then sum the scores of these rows
newdf['score1sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score1'].sum() for ud in drange]
newdf['score2sum'] = [df[(df['date1'] <= ud) & (df['date2'] >= ud)]['score2'].sum() for ud in drange]
# output
newdf
usedate score1sum score2sum
2016-01-01 8 8
2016-01-02 21 6
2016-01-03 32 13
2016-01-04 30 35
2016-01-05 13 26
方法二:辅助函数transform
(或apply
)
newdf = pd.DataFrame(data=drange, columns=['usedate'])
def sum_scores(d):
return df[(df['date1'] <= d) & (df['date2'] >= d)][['score1', 'score2']].sum()
# apply works here too, and is about equally fast in my testing
newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)
# newdf is same to above
时间比较
# Jupyter timeit cell magic
%%timeit
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score1'].sum() for d in drange]
newdf['score1sum'] = [df[(df['date1'] <= d) & (df['date2'] >= d)]['score2'].sum() for d in drange]
100 loops, best of 3: 10.4 ms per loop
# Jupyter timeit line magic
%timeit newdf[['score1sum', 'score2sum']] = newdf['usedate'].transform(sum_scores)
100 loops, best of 3: 8.51 ms per loop
您可以将 apply
与 lambda 函数一起使用:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
df1 = pd.DataFrame(index=pd.date_range(df.date1.min(), df.date2.max()), columns = ['score1sum', 'score2sum'])
df1[['score1sum','score2sum']] = df1.apply(lambda x: df.loc[(df.date1 <= x.name) &
(x.name <= df.date2),
['score1','score2']].sum(), axis=1)
df1.rename_axis('usedate').reset_index()
输出:
usedate score1sum score2sum
0 2016-01-01 8 8
1 2016-01-02 21 6
2 2016-01-03 32 13
3 2016-01-04 30 35
4 2016-01-05 13 26
conditional_join from pyjanitor 可能对 abstraction/convenience:
有帮助# pip install pyjanitor
import pandas as pd
import janitor as jn
drange = pd.DataFrame(drange, columns=['dates'])
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
(drange.conditional_join(df,
('dates', 'date1', '>='),
('dates', 'date2', '<='))
.droplevel(0, 1)
.select_columns('dates', 'score*')
.groupby('dates')
.sum()
.add_suffix('num')
)
score1num score2num
dates
2016-01-01 8 8
2016-01-02 21 6
2016-01-03 32 13
2016-01-04 30 35
2016-01-05 13 26