Pandas : 如何使用 .agg()

Pandas : How to use .agg()

我有一个数据框,其中包含客户对他们去过的餐厅的评分以及其他一些属性。


data = {'rating_id': ['1', '2','3','4','5','6','7'],
        'user_id': ['56', '13','56','99','99','13','12'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0'],
        'rating_year': ['2012','2012','2020','2001','2020','2015','2000'],
        'first_year': ['2012', '2012','2001','2001','2012','2000','2000'],
        'last_year': ['2020', '2020','2020','2020','2020','2015','2015'],
        }

df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])

df.head()

df['star_rating'] = df['star_rating'].astype(float)

# calculate the average of the stars of the first year 

ratings_mean_firstYear= df.groupby(['restaurant_id','first_year']).agg({'star_rating':[np.mean]})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_firstYear.reset_index()

# calculate the average of the stars of the last year 

ratings_mean_lastYear= df.groupby(['restaurant_id','last_year']).agg({'star_rating':[np.mean]})
ratings_mean_lastYear.columns = ['avg_lastYear']
ratings_mean_lastYear.reset_index()

# merge the means into a single table

ratings_average = ratings_mean_firstYear.merge(
    ratings_mean_lastYear.groupby('restaurant_id')['avg_lastYear'].max()
    , on='restaurant_id'
)

ratings_average.head(20)

我的问题是第一年和最后几年的平均值完全相同,这毫无意义,我真的不知道我在这里的思考过程中做错了什么..我怀疑发生了什么事使用 .agg 因为这是我第一次使用 pandas lib.

有什么建议吗?

您的数据以这样的方式提供,即每 user/restaurant 对具有单一评级,并且您在第一年和去年的汇总中都使用它 - 因此这两年自然是相等的。我首先使用 rating_year == first_year 条件过滤数据,然后应用 groupby 和 agg。然后对去年重复相同的操作,然后合并 2 个结果。在您的示例中,没有一条评论的数据与任何餐厅的第一年或最后一年相匹配。因此,要显示适当的示例将需要更多数据。我假设您在较大的数据框中拥有它。 –

这是一个示例,我添加了更多行并更改了年份以获得更多匹配项:

data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
        'user_id': ['56', '56','56','56', '99','99','99','99','99'],
        'restaurant_id':  ['xxx', 'xxx','yyy','yyy','xxx', 'xxx','yyy','yyy','xxx'],
        'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','4.0','2.5','3.0'],
        'rating_year': ['2012', '2020','2001','2020', '2012', '2020','2001','2020','2019'],
        'first_year': ['2012', '2012','2001','2001','2012', '2012','2001','2001','2012'],
        'last_year': ['2020', '2020','2020','2020','2020','2020','2020','2020','2020'],
        }

df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)

ratings_mean_firstYear = df[df.rating_year == df.first_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_firstYear.columns = ['avg_firstYear']
ratings_mean_lastYear= df[df.rating_year == df.last_year].groupby('restaurant_id').agg({'star_rating':'mean'})
ratings_mean_lastYear.columns = ['avg_lastYear']

结果:

ratings_mean_firstYear.merge(ratings_mean_lastYear, left_index=True, right_index=True)

               avg_firstYear  avg_lastYear
restaurant_id                             
xxx                     1.65          3.45
yyy                     2.60          3.75