如何查找重复的数据行和输出

how to find duplicated rows of data and output

目前正在查找重复项,但数据未显示行号、名称和编号,并且输出不正确(请参阅下面的预期输出)。

发生这种情况是因为 .duplicated returns 一个布尔序列 (True/False),您正在直接保存它。

但是您应该使用它来对数据进行子集化,如下所示:

import pandas as pd
import os


df_state = pd.DataFrame(
                [["3 Liu Yu,876"],
                ["4 Koh chong,123"],
                ["3 Liu Yu,876"]])

df_state = df_state[0].str.split(" ", expand= True)
print(df_state, "\n")

duplicated = df_state.duplicated() # just a boolean series
print(duplicated, "\n")

print(df_state[duplicated], "\n")  ## <- subset and save with .to_csv

# as Anders Källmar points out, you can also do this:

all_duplicated = df_state.duplicated(keep= False)
print(df_state[all_duplicated])


输出:

   0    1          2
0  3  Liu     Yu,876
1  4  Koh  chong,123
2  3  Liu     Yu,876 

0    False
1    False
2     True
dtype: bool 

   0    1       2
2  3  Liu  Yu,876 

   0    1       2
0  3  Liu  Yu,876
2  3  Liu  Yu,876

df.duplicatedkeep=False 结合使用以获得重复行的布尔掩码,然后提取行:

# split name / number from your csv file
df = pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
       .str.split('\t', expand=True)

# increment index to match line number
df.index += 1

# keep duplicate entries
out = df[df[0].duplicated(keep=False)]

# export to duplicated_data.csv
out.to_csv('duplicated_data.csv', header=False)

输出文件的内容:

15,ANDREW ZHAO CHONG,83091746
19,ANDREW ZHAO CHONG,83091746
26,ANDREW ZHAO CHONG,83091746
48,ANDREW ZHAO CHONG,83091746
53,KOH KANG RI,89943392
56,KOH KANG RI,89943392
63,ENOS ZHAO KANG SONG,80746554
66,ENOS ZHAO KANG SONG,80746554
80,ENOS ZHAO KANG SONG,80746554

一行版

pd.read_csv('names_dup2.csv', quoting=1, header=None)[0] \
  .str.split('\t', expand=True) \
  .assign(index=lambda x: x.index+1) \
  .set_index('index') \
  [lambda x: x[0].duplicated(keep=False)] \
  .to_csv('duplicated_data.csv', header=False)