如何在整个 pandas 数据框中查找重复值(不是行)?

How to find duplicate values (not rows) in an entire pandas dataframe?

考虑这个数据框。

df = pd.DataFrame(data={'one': list('abcd'),
                        'two': list('efgh'),
                        'three': list('ajha')})
  one two three
0   a   e     a
1   b   f     j
2   c   g     h
3   d   h     a

如何输出所有重复值及其各自的索引?输出看起来像这样。

  id value
0  2     h
1  3     h
2  0     a
3  0     a
4  3     a

尝试 .melt + .duplicated:

x = df.reset_index().melt("index")
print(
    x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
    .reset_index(drop=True)
    .rename(columns={"index": "id"})
)

打印:

   id value
0   0     a
1   3     h
2   0     a
3   2     h
4   3     a

我们可以stack the DataFrame, use Series.loc to keep only where value is Series.duplicated then Series.reset_index转换为DataFrame:

new_df = (
    df.stack()  # Convert to Long Form
        .droplevel(-1).rename_axis('id')  # Handle MultiIndex
        .loc[lambda x: x.duplicated(keep=False)]  # Filter Values
        .reset_index(name='value')  # Make Series a DataFrame
)

new_df:

   id value
0   0     a
1   0     a
2   2     h
3   3     h
4   3     a

我在这里使用 melt 重塑和 duplicated(keep=False) 到 select 重复项:

(df.rename_axis('id')
   .reset_index()
   .melt(id_vars='id')
   .loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
   .sort_values(by='id')
   .reset_index(drop=True)
 )

输出:

    id value
0   0     a
1   0     a
2   2     h
3   3     h
4   3     a