如何删除 pandas 数据框中的特定重复行？

Question

在此 pandas 数据框中：

df =

pos    index  data
21      36    a,b,c
21      36    a,b,c
23      36    c,d,e
25      36    f,g,h
27      36    g,h,k
29      39    a,b,c
29      39    a,b,c
31      39    .
35      39    c,k
36      41    g,h
38      41    k,l
39      41    j,k
39      41    j,k

我想删除仅在同一索引组中且位于子帧头部区域的重复行。

所以，我做到了：

 df_grouped = df.groupby(['index'], as_index=True)

现在，

 for i, sub_frame in df_grouped:
    subframe.apply(lamda g: ... remove one duplicate line in the head region if pos value is a repeat)

我想应用此方法，因为某些 pos 值将在不应删除的尾部区域重复。

任何建议。

预期输出：

 pos    index  data
removed
21      36    a,b,c
23      36    c,d,e
25      36    f,g,h
27      36    g,h,k
removed
29      39    a,b,c
31      39    .
35      39    c,k
36      41    g,h
38      41    k,l
39      41    j,k
39      41    j,k

Answer 1

如果不必在单个应用语句中完成，则此代码将仅删除头部区域中的重复项：

data= {'pos':[21, 21, 23, 25, 27, 29, 29, 31, 35, 36, 38, 39, 39],
       'idx':[36, 36, 36, 36, 36, 39, 39, 39, 39, 41, 41, 41, 41], 
       'data':['a,b,c', 'a,b,c', 'c,d,e', 'f,g,h', 'g,h,k', 'a,b,c', 'a,b,c', '.', 'c,k', 'g,h', 'h,l', 'j,k', 'j,k']
}

df = pd.DataFrame(data)

accum = []
for i, sub_frame in df.groupby('idx'):
    accum.append(pd.concat([sub_frame.iloc[:2].drop_duplicates(), sub_frame.iloc[2:]]))

df2 = pd.concat(accum)

print(df2)

EDIT2：我发布的链接命令的第一个版本是错误的，并且仅适用于样本数据。这个版本提供了一个更通用的解决方案来根据 OP 的请求删除重复行：

df.drop(df.groupby('idx')         # group by the index column
          .head(2)                # select the first two rows
          .duplicated()           # create a Series with True for duplicate rows
          .to_frame(name='duped') # make the Series a dataframe
          .query('duped')         # select only the duplicate rows
          .index)                 # provide index of duplicated rows to drop

如何删除 pandas 数据框中的特定重复行？

How to remove a specific repeated line in pandas dataframe?

python

delete-row

dataframe

pandas