如何在整个 pandas 数据框中查找重复值(不是行)?
How to find duplicate values (not rows) in an entire pandas dataframe?
考虑这个数据框。
df = pd.DataFrame(data={'one': list('abcd'),
'two': list('efgh'),
'three': list('ajha')})
one two three
0 a e a
1 b f j
2 c g h
3 d h a
如何输出所有重复值及其各自的索引?输出看起来像这样。
id value
0 2 h
1 3 h
2 0 a
3 0 a
4 3 a
尝试 .melt
+ .duplicated
:
x = df.reset_index().melt("index")
print(
x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
.reset_index(drop=True)
.rename(columns={"index": "id"})
)
打印:
id value
0 0 a
1 3 h
2 0 a
3 2 h
4 3 a
我们可以stack
the DataFrame, use Series.loc
to keep only where value
is Series.duplicated
then Series.reset_index
转换为DataFrame:
new_df = (
df.stack() # Convert to Long Form
.droplevel(-1).rename_axis('id') # Handle MultiIndex
.loc[lambda x: x.duplicated(keep=False)] # Filter Values
.reset_index(name='value') # Make Series a DataFrame
)
new_df
:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
我在这里使用 melt
重塑和 duplicated(keep=False)
到 select 重复项:
(df.rename_axis('id')
.reset_index()
.melt(id_vars='id')
.loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
.sort_values(by='id')
.reset_index(drop=True)
)
输出:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
考虑这个数据框。
df = pd.DataFrame(data={'one': list('abcd'),
'two': list('efgh'),
'three': list('ajha')})
one two three
0 a e a
1 b f j
2 c g h
3 d h a
如何输出所有重复值及其各自的索引?输出看起来像这样。
id value
0 2 h
1 3 h
2 0 a
3 0 a
4 3 a
尝试 .melt
+ .duplicated
:
x = df.reset_index().melt("index")
print(
x.loc[x.duplicated(["value"], keep=False), ["index", "value"]]
.reset_index(drop=True)
.rename(columns={"index": "id"})
)
打印:
id value
0 0 a
1 3 h
2 0 a
3 2 h
4 3 a
我们可以stack
the DataFrame, use Series.loc
to keep only where value
is Series.duplicated
then Series.reset_index
转换为DataFrame:
new_df = (
df.stack() # Convert to Long Form
.droplevel(-1).rename_axis('id') # Handle MultiIndex
.loc[lambda x: x.duplicated(keep=False)] # Filter Values
.reset_index(name='value') # Make Series a DataFrame
)
new_df
:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a
我在这里使用 melt
重塑和 duplicated(keep=False)
到 select 重复项:
(df.rename_axis('id')
.reset_index()
.melt(id_vars='id')
.loc[lambda d: d['value'].duplicated(keep=False), ['id','value']]
.sort_values(by='id')
.reset_index(drop=True)
)
输出:
id value
0 0 a
1 0 a
2 2 h
3 3 h
4 3 a