Python Pandas Dataframe - 基于条件的分组和平均

Python Pandas Dataframe - Groupby and Average based on Condition

我有一个如下所示的数据框:

id  start       end         diff mindiff
1   2015-01-02  2015-07-01  180 57
2   2015-02-03  2015-05-12  98  56
3   2015-01-15  2015-01-20  5   5
4   2015-02-04  2015-04-15  70  55
5   2015-03-15  2015-05-01  47  46
6   2015-02-22  2015-03-01  7   7
7   2015-03-21  2015-04-12  22  22
8   2015-04-11  2015-06-15  65  50
9   2015-04-11  2015-05-01  20  20
10  2015-03-30  2015-04-01  2   2
11  2015-04-28  2015-06-15  48  33
12  2015-05-01  2015-06-01  31  31
13  2015-05-10  2015-06-09  30  30
14  2015-05-19  2015-07-01  43  42
15  2015-06-01  2015-06-06  5   5
16  2015-06-02  2015-06-29  27  27
17  2015-04-29  2015-05-21  22  22
18  2015-05-25  2015-07-01  37  36
19  2015-06-04  2015-06-26  22  22
20  2015-06-21  2015-07-01  10  10
21  2015-05-30  2015-06-06  7   7
22  2015-06-30  2015-07-01  1   1

字段是id, start(date), end(date), diff(开始和结束之间的天数), mindiff(the min(diff and last day x months from start).

x 在这种情况下是 1(所以一个月 "later than" 开始日期)

我想要完成的是找到 mindiff 的平均值(均值),按 'end' 的 year/month 分组,但仅对每组具有 [= 的记录进行平均35=] year/month x(以上定义)个月到 groupedby 个月。上面数据集中的示例,id 1 只会在 year/month 2015/1 和 2015/1+x (2015/2) 中取平均值。

这里是一个 table 标记每条记录以及我想在哪个月计算平均值:

    Months                      
id  1   2   3   4   5   6   7
1   1   1                   
2       1   1               
3   1                       
4       1   1               
5           1   1           
6       1   1               
7           1   1           
8               1   1       
9               1   1       
10          1   1           
11              1   1       
12                  1   1   
13                  1   1   
14                  1   1   
15                      1   
16                      1   
17              1   1       
18                  1   1   
19                      1   
20                      1   1
21                  1   1   
22                      1   1

这是思维导图和结果 AVG/month 我正在寻找:

    Months                      
id  1   2   3   4   5   6   7
1   57  57                  
2       56  56              
3   5                       
4       55  55              
5           46  46          
6       7   7               
7           22  22          
8               50  50      
9               20  20      
10          2   2           
11              33  33      
12                  31  31  
13                  30  30  
14                  42  42  
15                      5   
16                      27  
17              22  22      
18                  36  36  
19                      22  
20                      10  10
21                  7   7   
22                      1   1
AVG 31  43.8    31.3    27.9    30.1    21.1    5.5

最后,这是我正在寻找的数据框:

Month   Avg Diff Trailing x months
2015-01 31
2015-02 43.75
2015-03 31.33333333
2015-05 27.85714286
2015-05 30.11111111
2015-06 21.1
2015-07 5.5

我知道这可以通过循环实现,但我的直觉告诉我 GROUPBY 更像 pythonic,而且可能更高效。但是,我如何才能在 'end' year/month 的 groupby 内对 'start' 个月的特定滚动 mindiff 值进行平均。谢谢您的帮助。

首先我创建了不同年份的测试数据并将最后一行的开始设置为十二月。然后我将 startend 列转换为句点 - periodSperiodE 列。

我按列 month 使用函数 groupby 并计算列 Avg 的平均值:

g = df1.groupby('months')['Avg'].mean().reset_index()
import pandas as pd
import numpy as np
import io

temp=u"""id;start;end
1;2014-01-02;2014-07-01
2;2014-02-03;2014-05-12
3;2014-01-15;2014-01-20
4;2014-02-04;2014-04-15
5;2014-03-15;2014-05-01
6;2014-02-22;2014-03-01
7;2015-03-21;2015-04-12
8;2015-04-11;2015-06-15
9;2015-04-11;2015-05-01
10;2015-03-30;2015-04-01
11;2015-04-28;2015-06-15
12;2015-05-01;2015-06-01
13;2015-05-10;2015-06-09
14;2016-05-19;2016-07-01
15;2016-06-01;2016-06-06
16;2016-06-02;2016-06-29
17;2016-04-29;2016-05-21
18;2016-05-25;2016-07-01
19;2017-06-04;2017-06-26
20;2017-06-21;2017-07-01
21;2017-05-30;2017-06-06
22;2017-12-30;2018-02-01"""

df = pd.read_csv(io.StringIO(temp), sep=";", index_col=[0])
print df
def last_day_of_next_month(any_day):
    next_month = any_day.replace(day=28) + pd.Timedelta(days=36)  # this will never fail
    return next_month - pd.Timedelta(days=next_month.day)

df['mindiff'] = (pd.to_datetime(df['start']).apply(last_day_of_next_month) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['diff'] = (pd.to_datetime(df['end']) - pd.to_datetime(df['start'])).astype('timedelta64[D]')
df['mindiff'] = df[['mindiff', 'diff']].apply(lambda x: min(x), axis=1)
#print df

#set day of start and end to periodindex
df['periodS'] =  pd.to_datetime(df['start']).dt.to_period('M')
df['periodE'] =  pd.to_datetime(df['end']).dt.to_period('M')

#if period end is higher as period start, add one month else NaN
df['period'] = np.where(df['periodE'] > df['periodS'],df['periodS'] + 1, np.nan)
#print df
#df from subset
df1 = df[['mindiff', 'periodS', 'period']]
#pivot data (from rows to columns)
df1 = df1.set_index('mindiff').stack().reset_index()
#rename columns names
df1.columns = ['Avg', 'tmp', 'months']
#groupby by column month and count mean from column Avg
g = df1.groupby('months')['Avg'].mean().reset_index()
print g
#     months        Avg
#0   2014-01  31.000000
#1   2014-02  43.750000
#2   2014-03  41.000000
#3   2014-04  46.000000
#4   2015-03  12.000000
#5   2015-04  25.400000
#6   2015-05  32.800000
#7   2015-06  30.500000
#8   2016-04  22.000000
#9   2016-05  33.333333
#10  2016-06  27.500000
#11  2017-05   7.000000
#12  2017-06  13.000000
#13  2017-07  10.000000
#14  2017-12  32.000000
#15  2018-01  32.000000