如何使用另一个系列作为过滤器来过滤一个系列中的数据？

Question

我有一个包含销售数据的数据框。它看起来像这样：

import pandas as pd
df = pd.DataFrame({'order_id': ['A1', 'A2', 'A3', 'A4', 'A5'], 
                   'customer_id': ['C1', 'C2', 'C3', 'C4', 'C5'], 
                   'store': ['Hardware1', 'Grocery3', 'Beauty5', 'Pet2', 'Electronics4'],
                   'price': [20.59, 38.97, 56.84, 89.88, 156.64],
                   'rating': [5, 4, 3,'NA',4]})

我希望向满足以下条件的商店提供促销优惠：

商店在数据框中的评分必须超过 30
商店的平均评分必须大于 4

两个条件都满足后，我想return满足以上两个条件的店铺，这样我就知道哪些店铺可以收到促销优惠。

我对分解数据以实现此目的的最佳方式感到困惑。我正在考虑开始使用我需要的数据创建数据框的一个子集，它看起来像：

promo = df[['store', 'rating']]

在那之后，我不确定最好做什么。我不确定我是否应该创建一个函数来确定平均评分并将该函数与 'store' 上的 .apply() 方法一起使用。但是，我不确定某个函数是否有意义，因为我不知道在确定平均评分时如何考虑商店。我在想：

promo.groupby('store')['rating']

但是，在我清除 'rating' 以处理或忽略 NA 值之前，我也不知道这是否有意义。我也考虑过使用 .where() 但是，我不知道我将定义什么作为应用于 'store' 系列的过滤器。

如有任何想法，我们将不胜感激。

Answer 1

您可以使用 GroupBy.count 来计算每个组的评分数和 GroupBy.mean。 Pandas 有 GroupBy.agg 聚合数据。我们使用 count 和 mean 来汇总每个组中的数据。

#Convert 'NA' to NaN
df['rating'] = df['rating'].replace('NA', np.nan) #dtype is float here.
# To maintain ints we have to use `.astype('Int64') which supports Nullable int.
# df['rating'] = df['rating'].replace('NA', np.nan).astype('Int64') # Capital I.

# `count` doesn't include NA values.
stores = df.groupby('store')['rating'].agg(('count', 'mean')).reset_index()
m = stores['count'].gt(30) & stores['mean'].gt(4) # Stores with more than 30
                                                  # rating and avg. rating > 4
out = stores.loc[m, "store"]

如何使用另一个系列作为过滤器来过滤一个系列中的数据？

How do I filter data in one series using another series as the filter?

python

data-manipulation

series

dataframe

pandas