获取 pandas 中特定日期范围的平均值

get the average for specific range of dates in pandas

我需要按网站对数据进行分组,并获取特定日期范围内的平均浏览量。我的数据如下所示:

date        website         amount_views
1/1/2021        a               23
1/2/2021        a               17
1/3/2021        a               10
1/4/2021        a               25
1/5/2021        a               2
1/1/2021        b               12
1/2/2021        b               7
1/3/2021        b               5
1/4/2021        b               17
1/5/2021        b               2

所以我需要查看两个日期范围(1/1/2021 - 1/3/2021(之前)和 1/3/2021 - 1/5/)的 a 和 b 网站的平均值是多少2021 (post)) 期望的输出是:

date        website         avg_amount_views
pre            a                 31.5
post           a                 35.6
pre            b                 15.5
post           b                 22.6
import pandas as pd

# test dataframe
data = {'date': ['1/1/2021', '1/2/2021', '1/3/2021', '1/4/2021', '1/5/2021', '1/1/2021', '1/2/2021', '1/3/2021', '1/4/2021', '1/5/2021'], 'website': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'], 'amount_views': [23, 17, 10, 25, 2, 12, 7, 5, 17, 2]}

df = pd.DataFrame(data)

# set the date column to a datetime format - required
df.date = pd.to_datetime(df.date)

# groupby with pd.Grouper
mean_visits = df.groupby([pd.Grouper(key='date', freq='W'), 'website'])['amount_views'].mean().reset_index(name='mean_visits')

# display(mean_visits)
        date website  mean_visits
0 2021-01-03       a    16.666667
1 2021-01-03       b     8.000000
2 2021-01-10       a    13.500000
3 2021-01-10       b     9.500000

您可以使用 np.where 和 date.between 分配 pre 和 post 状态并按相同和网站分组并求平均值。

在一行中(虽然不是那么可读):

  df['date']=pd.to_datetime(df['date'])
  df.groupby([np.where(df['date'].between('1/1/2021','1/3/2021'),'pre'\
  ,'post'),'website'])['amount_views'].mean().to_frame('mean')

循序渐进(更具可读性):

df['date']=pd.to_datetime(df['date'])
df['status']=np.where(df['date'].between('1/1/2021','1/3/2021'),'pre','post')
df.groupby(['status','website'])['amount_views'].mean().to_frame('mean')

                     mean
status website           
post   a        13.500000
       b         9.500000
pre    a        16.666667
       b         8.000000

使用:

dates = pd.to_datetime(df['date'])
new_df = (df.groupby(['website', np.select((dates.between('1/1/2021', '1/3/2021'), 
                                           dates.between('1/3/2021', '1/5/2021')), 
                                           ('pre', 'pos'))])
            .amount_views
            .mean()
            .rename_axis(('website', 'date'))
            .reset_index(name='avg_amount_views'))
print(new_df)

  website date  avg_amount_views
0       a  pos         13.500000
1       a  pre         16.666667
2       b  pos          9.500000
3       b  pre          8.000000

您可以使用pd.cut来定义'pre'和'post':

grp = pd.cut(df['date'], bins=[pd.Timestamp(2021, 1, 1), 
                               pd.Timestamp(2021, 1, 3), 
                               pd.Timestamp(2021, 1, 6)], labels=['pre', 'post'],
      right=False)

df.groupby([grp, 'website'])['amount_views'].agg(['mean','count']).reset_index()

输出:

   date website       mean  count
0   pre       a  20.000000      2
1   pre       b   9.500000      2
2  post       a  12.333333      3
3  post       b   8.000000      3