Python Dataframe:根据特定条件删除重复项
Python Dataframe: Dropping duplicates base on certain conditions
具有重复商店 ID 的数据框,其中一些商店 ID 出现两次,一些出现三次:
我只想根据分配给其区域的最短商店距离保留唯一的商店 ID。
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
使用 pandas drop_duplicates 删除重复行,但条件是基于第一个/最后一个出现的商店 ID,这不允许我按距离排序:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
我也试过按Shop ID分组然后排序,但是排序returns错误:Duplicates
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
到目前为止,我尝试做到这个阶段:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
我想我做了很长的路,但仍然没有弄清楚删除重复项,有人知道如何以更短的方式解决这个问题吗?我是 python 的新手,感谢您的帮助!
尝试先根据距离对数据框进行排序,然后删除重复的商店。
df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()] # The tilda (~) inverts the boolean mask.
或者就像一个链式表达式(根据@chmielcode 的评论)。
df = (
shops_df
.sort_values('Distance')
.drop_duplicates(subset='Shop ID', keep= 'first')
.reset_index(drop=True) # Optional.
)
您可以使用 idxmin:
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
4 MMM Ju 43 4f76d0c0e4b01af7
具有重复商店 ID 的数据框,其中一些商店 ID 出现两次,一些出现三次:
我只想根据分配给其区域的最短商店距离保留唯一的商店 ID。
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
1 AAA Hi 230 5ce5522012138400
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
...
91 MMM Ju 43 4f76d0c0e4b01af7
92 MMM Hi 1150 5ce5522012138400
...
使用 pandas drop_duplicates 删除重复行,但条件是基于第一个/最后一个出现的商店 ID,这不允许我按距离排序:
shops_df = shops_df.drop_duplicates(subset='Shop ID', keep= 'first')
我也试过按Shop ID分组然后排序,但是排序returns错误:Duplicates
bbtshops_new['C'] = bbtshops_new.groupby('Shop ID')['Shop ID'].cumcount()
bbtshops_new.sort_values(by=['C'], axis=1)
到目前为止,我尝试做到这个阶段:
# filter all the duplicates into a new df
df_toclean = shops_df[shops_df['Shop ID'].duplicated(keep= False)]
# create a mask for all unique Shop ID
mask = df_toclean['Shop ID'].value_counts()
# create a mask for the Shop ID that occurred 2 times
shop_2 = mask[mask==2].index
# create a mask for the Shop ID that occurred 3 times
shop_3 = mask[mask==3].index
# create a mask for the Shops that are under radius 750
dist_1 = df_toclean['Shop Distance']<=750
# returns results for all the Shop IDs that appeared twice and under radius 750
bbtshops_2 = df_toclean[dist_1 & df_toclean['Shop ID'].isin(shop_2)]
* if i use df_toclean['Shop Distance'].min() instead of dist_1 it returns 0 results
我想我做了很长的路,但仍然没有弄清楚删除重复项,有人知道如何以更短的方式解决这个问题吗?我是 python 的新手,感谢您的帮助!
尝试先根据距离对数据框进行排序,然后删除重复的商店。
df = shops_df.sort_values('Distance')
df = df[~df['Shop ID'].duplicated()] # The tilda (~) inverts the boolean mask.
或者就像一个链式表达式(根据@chmielcode 的评论)。
df = (
shops_df
.sort_values('Distance')
.drop_duplicates(subset='Shop ID', keep= 'first')
.reset_index(drop=True) # Optional.
)
您可以使用 idxmin:
df.loc[df.groupby('Area')['Shop Distance'].idxmin()]
Area Shop Name Shop Distance Shop ID
0 AAA Ly 86 5d87790c46a77300
2 BBB Hi 780 5ce5522012138400
3 CCC Ly 450 5d87790c46a77300
4 MMM Ju 43 4f76d0c0e4b01af7