Pandas:如何通过大于考虑索引来过滤列
Pandas: How to filter a column by greater than considering an index
我有一个代表餐厅顾客评分的数据框。 star_rating
是客户在此数据框中的评分。
- 我想做的是在同一个数据框中添加一列
nb_fave_rating
,代表餐厅的好评总数。我认为如果它的星数是 > = 3
. ,则给予“赞成”意见
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '13','56','99','99','13','12','88','45'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id')
positive_reviews.head()
从这里开始,我不知道要计算餐厅正面评价的数量并将其添加到我的初始数据框的新列中 df
。
预期的输出是这样的。
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '13','56','99','99','13','12','88','45'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
'nb_fave_rating': ['1', '1','1','1','1','1','1','0','0'],
}
所以我尝试了这个并得到了一堆 NaN
df['nb_fave_rating']=df[df.star_rating >= 3.0 ].groupby('restaurant_id').agg({'star_rating': 'count'})
df.head()
一行完成。
groupby()
、transform
布尔选择并将结果转换为 integer
.
df['nb_fave_rating']=df.groupby('restaurant_id')['star_rating'].transform(lambda x: int((x>=3).sum()))
rating_id user_id restaurant_id star_rating rating_year first_year \
0 1 56 xxx 2.3 2012 2012
1 2 13 xxx 3.7 2012 2012
2 3 56 yyy 1.2 2020 2001
3 4 99 yyy 5.0 2001 2001
4 5 99 xxx 1.0 2020 2012
5 6 13 zzz 3.2 2015 2000
6 7 12 zzz 1.0 2000 2000
7 8 88 eee 2.2 2003 2001
8 9 45 eee 0.2 2004 2001
last_year nb_fave_rating
0 2020 1.0
1 2020 1.0
2 2020 1.0
3 2020 1.0
4 2020 1.0
5 2015 1.0
6 2015 1.0
7 2020 0.0
8 2020 0.0
- 使用
map
的 solution from Grayrigel 是最快的解决方案。
- 使用
.groupby
获取每个 restaurant_id
的评分 >=3
.merge
positive_reviews
回到 df
.
positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id', as_index=False).agg({'star_rating': 'count'}).rename(columns={'star_rating': 'nb_fave_rating'})
# join back to df
df = df.merge(positive_reviews, how='left', on='restaurant_id').fillna(0)
# display(df)
rating_id user_id restaurant_id star_rating rating_year first_year last_year nb_fave_rating
0 1 56 xxx 2.3 2012 2012 2020 1.0
1 2 13 xxx 3.7 2012 2012 2020 1.0
2 3 56 yyy 1.2 2020 2001 2020 1.0
3 4 99 yyy 5.0 2001 2001 2020 1.0
4 5 99 xxx 1.0 2020 2012 2020 1.0
5 6 13 zzz 3.2 2015 2000 2015 1.0
6 7 12 zzz 1.0 2000 2000 2015 1.0
7 8 88 eee 2.2 2003 2001 2020 0.0
8 9 45 eee 0.2 2004 2001 2020 0.0
%timeit
比较
- 给定 9 行数据框,
df
在问题中
# create a test dataframe of 1,125,000 rows
dfl = pd.concat([df] * 125000).reset_index(drop=True)
# test with transform
def add_rating_transform(df):
return df.groupby('restaurant_id')['star_rating'].transform(lambda x: int((x>=3).sum()))
%timeit add_rating_transform(dfl)
[out]:
222 ms ± 9.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# test with map
def add_rating_map(df):
filtered_data = df[df['star_rating'] >= 3]
d = filtered_data.groupby('restaurant_id')['star_rating'].count().to_dict()
return df['restaurant_id'].map(d).fillna(0).astype(int)
%timeit add_rating_map(dfl)
[out]:
105 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# test with merge
def add_rating_merge(df):
positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id', as_index=False).agg({'star_rating': 'count'}).rename(columns={'star_rating': 'nb_fave_rating'})
return df.merge(positive_reviews, how='left', on='restaurant_id').fillna(0)
%timeit add_rating_merge(dfl)
[out]:
639 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#filtering the data with >=3 ratings
filtered_data = df[df['star_rating'] >= 3]
#creating a dict containing the counts of the all the favorable reviews
d = filtered_data.groupby('restaurant_id')['star_rating'].count().to_dict()
#mapping the dictionary to the restaurant_id to generate 'nb_fave_rating'
df['nb_fave_rating'] = df['restaurant_id'].map(d)
#taking care of `NaN` values
df.fillna(0,inplace=True)
#making the column integer (just to match the requirements)
df['nb_fave_rating'] = df['nb_fave_rating'].astype(int)
print(df)
输出:
rating_id user_id restaurant_id star_rating rating_year first_year last_year nb_fave_rating
0 1 56 xxx 2.3 2012 2012 2020 1
1 2 13 xxx 3.7 2012 2012 2020 1
2 3 56 yyy 1.2 2020 2001 2020 1
3 4 99 yyy 5.0 2001 2001 2020 1
4 5 99 xxx 1.0 2020 2012 2020 1
5 6 13 zzz 3.2 2015 2000 2015 1
6 7 12 zzz 1.0 2000 2000 2015 1
7 8 88 eee 2.2 2003 2001 2020 0
8 9 45 eee 0.2 2004 2001 2020 0
计算评分 >= 3.0 的情况
df['nb_fave_rating'] = df.groupby('restaurant_id')['star_rating'].transform(lambda x: x.ge(3.0).sum()).astype(np.int)
我有一个代表餐厅顾客评分的数据框。 star_rating
是客户在此数据框中的评分。
- 我想做的是在同一个数据框中添加一列
nb_fave_rating
,代表餐厅的好评总数。我认为如果它的星数是> = 3
. ,则给予“赞成”意见
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '13','56','99','99','13','12','88','45'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
}
df = pd.DataFrame (data, columns = ['rating_id','user_id','restaurant_id','star_rating','rating_year','first_year','last_year'])
df['star_rating'] = df['star_rating'].astype(float)
positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id')
positive_reviews.head()
从这里开始,我不知道要计算餐厅正面评价的数量并将其添加到我的初始数据框的新列中 df
。
预期的输出是这样的。
data = {'rating_id': ['1', '2','3','4','5','6','7','8','9'],
'user_id': ['56', '13','56','99','99','13','12','88','45'],
'restaurant_id': ['xxx', 'xxx','yyy','yyy','xxx','zzz','zzz','eee','eee'],
'star_rating': ['2.3', '3.7','1.2','5.0','1.0','3.2','1.0','2.2','0.2'],
'rating_year': ['2012','2012','2020','2001','2020','2015','2000','2003','2004'],
'first_year': ['2012', '2012','2001','2001','2012','2000','2000','2001','2001'],
'last_year': ['2020', '2020','2020','2020','2020','2015','2015','2020','2020'],
'nb_fave_rating': ['1', '1','1','1','1','1','1','0','0'],
}
所以我尝试了这个并得到了一堆 NaN
df['nb_fave_rating']=df[df.star_rating >= 3.0 ].groupby('restaurant_id').agg({'star_rating': 'count'})
df.head()
一行完成。
groupby()
、transform
布尔选择并将结果转换为 integer
.
df['nb_fave_rating']=df.groupby('restaurant_id')['star_rating'].transform(lambda x: int((x>=3).sum()))
rating_id user_id restaurant_id star_rating rating_year first_year \
0 1 56 xxx 2.3 2012 2012
1 2 13 xxx 3.7 2012 2012
2 3 56 yyy 1.2 2020 2001
3 4 99 yyy 5.0 2001 2001
4 5 99 xxx 1.0 2020 2012
5 6 13 zzz 3.2 2015 2000
6 7 12 zzz 1.0 2000 2000
7 8 88 eee 2.2 2003 2001
8 9 45 eee 0.2 2004 2001
last_year nb_fave_rating
0 2020 1.0
1 2020 1.0
2 2020 1.0
3 2020 1.0
4 2020 1.0
5 2015 1.0
6 2015 1.0
7 2020 0.0
8 2020 0.0
- 使用
map
的 solution from Grayrigel 是最快的解决方案。 - 使用
.groupby
获取每个restaurant_id
的评分 .merge
positive_reviews
回到df
.
>=3
positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id', as_index=False).agg({'star_rating': 'count'}).rename(columns={'star_rating': 'nb_fave_rating'})
# join back to df
df = df.merge(positive_reviews, how='left', on='restaurant_id').fillna(0)
# display(df)
rating_id user_id restaurant_id star_rating rating_year first_year last_year nb_fave_rating
0 1 56 xxx 2.3 2012 2012 2020 1.0
1 2 13 xxx 3.7 2012 2012 2020 1.0
2 3 56 yyy 1.2 2020 2001 2020 1.0
3 4 99 yyy 5.0 2001 2001 2020 1.0
4 5 99 xxx 1.0 2020 2012 2020 1.0
5 6 13 zzz 3.2 2015 2000 2015 1.0
6 7 12 zzz 1.0 2000 2000 2015 1.0
7 8 88 eee 2.2 2003 2001 2020 0.0
8 9 45 eee 0.2 2004 2001 2020 0.0
%timeit
比较
- 给定 9 行数据框,
df
在问题中
# create a test dataframe of 1,125,000 rows
dfl = pd.concat([df] * 125000).reset_index(drop=True)
# test with transform
def add_rating_transform(df):
return df.groupby('restaurant_id')['star_rating'].transform(lambda x: int((x>=3).sum()))
%timeit add_rating_transform(dfl)
[out]:
222 ms ± 9.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# test with map
def add_rating_map(df):
filtered_data = df[df['star_rating'] >= 3]
d = filtered_data.groupby('restaurant_id')['star_rating'].count().to_dict()
return df['restaurant_id'].map(d).fillna(0).astype(int)
%timeit add_rating_map(dfl)
[out]:
105 ms ± 1.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# test with merge
def add_rating_merge(df):
positive_reviews = df[df.star_rating >= 3.0 ].groupby('restaurant_id', as_index=False).agg({'star_rating': 'count'}).rename(columns={'star_rating': 'nb_fave_rating'})
return df.merge(positive_reviews, how='left', on='restaurant_id').fillna(0)
%timeit add_rating_merge(dfl)
[out]:
639 ms ± 26.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#filtering the data with >=3 ratings
filtered_data = df[df['star_rating'] >= 3]
#creating a dict containing the counts of the all the favorable reviews
d = filtered_data.groupby('restaurant_id')['star_rating'].count().to_dict()
#mapping the dictionary to the restaurant_id to generate 'nb_fave_rating'
df['nb_fave_rating'] = df['restaurant_id'].map(d)
#taking care of `NaN` values
df.fillna(0,inplace=True)
#making the column integer (just to match the requirements)
df['nb_fave_rating'] = df['nb_fave_rating'].astype(int)
print(df)
输出:
rating_id user_id restaurant_id star_rating rating_year first_year last_year nb_fave_rating
0 1 56 xxx 2.3 2012 2012 2020 1
1 2 13 xxx 3.7 2012 2012 2020 1
2 3 56 yyy 1.2 2020 2001 2020 1
3 4 99 yyy 5.0 2001 2001 2020 1
4 5 99 xxx 1.0 2020 2012 2020 1
5 6 13 zzz 3.2 2015 2000 2015 1
6 7 12 zzz 1.0 2000 2000 2015 1
7 8 88 eee 2.2 2003 2001 2020 0
8 9 45 eee 0.2 2004 2001 2020 0
计算评分 >= 3.0 的情况
df['nb_fave_rating'] = df.groupby('restaurant_id')['star_rating'].transform(lambda x: x.ge(3.0).sum()).astype(np.int)