Pandas:根据子串去重
Pandas: remove duplicates based on substring
我有以下 2 列,来自 Pandas DataFrame:
antecedents consequents
apple orange
orange apple
apple water
apple pineapple
water lemon
lemon water
我想删除作为机器人前因后果出现的重复项,只保留第一个出现的,从而获得:
antecedents consequents
apple orange
apple water
apple pineapple
water lemon
如何使用 Pandas 实现该目标?
两列都使用 frozenset
并通过 Series.duplicated
测试副本:
df2 = df[~df[['antecedents','consequents']].apply(frozenset,axis=1).duplicated()]
或对 numpy.sort
中每行的值进行排序:
df1 = pd.DataFrame(np.sort(df[['antecedents','consequents']], axis=1), index=df.index)
df2 = df[~df1.duplicated()]
print (df2)
antecedents consequents
0 apple orange
2 apple water
3 apple pineapple
4 water lemon
我有以下 2 列,来自 Pandas DataFrame:
antecedents consequents
apple orange
orange apple
apple water
apple pineapple
water lemon
lemon water
我想删除作为机器人前因后果出现的重复项,只保留第一个出现的,从而获得:
antecedents consequents
apple orange
apple water
apple pineapple
water lemon
如何使用 Pandas 实现该目标?
两列都使用 frozenset
并通过 Series.duplicated
测试副本:
df2 = df[~df[['antecedents','consequents']].apply(frozenset,axis=1).duplicated()]
或对 numpy.sort
中每行的值进行排序:
df1 = pd.DataFrame(np.sort(df[['antecedents','consequents']], axis=1), index=df.index)
df2 = df[~df1.duplicated()]
print (df2)
antecedents consequents
0 apple orange
2 apple water
3 apple pineapple
4 water lemon