根据目标 Class 条件删除重复行

Question

我有一个包含 3 个目标的数据集 classes：“是”、“可能”和“否”。

Unique_id       target
111              Yes
111             Maybe
111              No
112              No
112             Maybe
113              No

我想删除基于 unique_id 的重复行。但是'drop duplicates'一般保留第一行或最后一行，我想根据以下条件保留行：

1) If unique_id has all the 3 classes (Yes, Maybe and No), we’ll keep only the ‘Yes’ class.
2) If unique_id has the 2 classes (Maybe and No), we’ll keep only the ‘Maybe’ class.
3) We’ll keep the ‘No’ class when only ‘No’ will be there.

我尝试了“sort_values”目标 class（是=1，可能=2，否=3），然后删除了重复项。

期望的输出：

Unique_id       target
111               Yes
112              Maybe
113               No

我在想是否有更好的方法来做到这一点。

如有任何建议，我们将不胜感激。谢谢！

Answer 1

您可以通过 pd.CategoricalDtype 将列 target 设置为分类数据类型，顺序为 ['Yes' < 'Maybe' < 'No']，如下所示：

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

然后，您使用 .groupby() and take the min on target within the group of same Unique_id using .GroupBy.min() 按 Unique_id 分组：

df.groupby('Unique_id', as_index=False)['target'].min()

结果：

   Unique_id target
0        111    Yes
1        112  Maybe
2        113     No

编辑

案例 1： 如果您有 2 个或更多相似的列（例如 target 和 target2）以相同的顺序排序，您只需要将代码应用于 2 列。例如，如果我们有以下数据框：

   Unique_id target target2
0        111    Yes      No
1        111  Maybe   Maybe
2        111     No     Yes
3        112     No      No
4        112  Maybe   Maybe
5        113     No   Maybe

您可以同时获取2列的最小值，如下：

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df[['target', 'target2']] = df[['target', 'target2']].astype(t)

df.groupby('Unique_id', as_index=False)[['target', 'target2']].min()

结果：

   Unique_id target target2
0        111    Yes     Yes
1        112  Maybe   Maybe
2        113     No   Maybe

案例 2： 如果你想显示数据框中的所有列而不只是 Unique_id 和 target 列，你可以使用更简单的语法，如下：

另一个数据框示例：

   Unique_id target  Amount
0        111    Yes     123
1        111  Maybe     456
2        111     No     789
3        112     No    1234
4        112  Maybe    5678
5        113     No      25

然后，要显示带有 Unique_id 最小值的 target 的所有列，您可以使用：

t = pd.CategoricalDtype(categories=['Yes', 'Maybe', 'No'], ordered=True)
df['target'] = df['target'].astype(t)

df.loc[df.groupby('Unique_id')['target'].idxmin()]

结果：

   Unique_id target  Amount
0        111    Yes     123
4        112  Maybe    5678
5        113     No      25

Answer 2

使用 map 和 idxmin:

t = {'Yes':0, 'Maybe':1, 'No':2}
df.loc[df.assign(tar=df.target.map(t)).groupby('Unique_id')['tar'].idxmin()]

根据目标 Class 条件删除重复行

Drop Duplicate Rows Based on Target Class Conditions

python

data-manipulation

dataframe

pandas

drop-duplicates

编辑