如何使用 python 比较两个不同的 csv 文件?

How can I compare two different csv file using python?

我想制作比较两个csv文件的代码!

import pandas as pd
import numpy as np

    df = pd.read_csv("E:\Dupfile.csv")
    df1 = pd.read_csv("E:\file.csv")
    
    df['Correct'] = None
    
    def Result(x):
       if ....:
         return int(1)
       else:
         return int(0)
    
    
    df.loc[:,"Correct"]=df.apply(Result,axis=1)
    
    print(df["Correct"])
    
    df.to_csv("E:\file.csv")
    print(df.head(20))

例如,file.csv格式如下:

     round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41     
1     2  2021.04      2       9     10      16     35     37      
2     3  2021.04      4      15     24      35     36     40      
3     4  2021.03     10      11     20      21     25     41     
4     5  2021.03      4       9     23      26     29     33     
5     6  2021.03      1       9     26      28     30     41     

Dupfile.csv 如下所示:

    round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41  
0     1  2021.04      1       2      3       4      5      6    
1     2  2021.04      2       9     10      16     35     37   
1     2  2021.04      1       2      3       4      5      6      
2     3  2021.04      4      15     24      35     36     40    
2     3  2021.04      1       2      3       4      5      6     
3     4  2021.03     10      11     20      21     25     41  
3     4  2021.03      1       2      3       4      5      6     
4     5  2021.03      4       9     23      26     29     33  
4     5  2021.03      1       2      3       4      5      6   

它还有一个相同的回合,但价值不同。

用 Dupfile 的 round 检查文件的 round 值,如果第一个到第六个值相等,则在 Dupfile 中创建另一个“正确”列并输入 1。如果不正确,将 0 放入“正确”列。

我试图比较两个不同的 csv 文件,但我不知道该怎么做。 有人可以帮助我吗?

我的期望答案:

    round    date  first  second  third  fourth  fifth  sixth Correct
0     1  2021.04      1      14     15      24     40     41    1
0     1  2021.04      1       2      3       4      5      6    0
1     2  2021.04      2       9     10      16     35     37    1
1     2  2021.04      1       2      3       4      5      6    0  
2     3  2021.04      4      15     24      35     36     40    1
2     3  2021.04      1       2      3       4      5      6    0 
3     4  2021.03     10      11     20      21     25     41    1
3     4  2021.03      1       2      3       4      5      6    0 
4     5  2021.03      4       9     23      26     29     33    1
4     5  2021.03      1       2      3       4      5      6    0

你可以用 pandas 使用 df.merge.

查看示例:

import pandas as pd


# file.csv
file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("6", "2021.03", "1", "9", "26", "28", "30", "41"),
    ],
)

# adding control column (we already know that those are the right values)
file_df["correct"] = 1

# Dupfile.csv
dup_file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("1", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("2", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("3", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("4", "2021.03", "1", "2", "3", "4", "5", "6"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("5", "2021.03", "1", "2", "3", "4", "5", "6"),
    ],
)

# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]

# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth). 
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)

# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)

# Done!
print(merged)

如果使用pandas模块,最好获取模块中提供的方法。我建议您尝试使用 merge 来比较 2 个不同的数据帧。我重写你的代码如下。

import pandas as pd

df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")

df1['Correct'] = 1

df = df.merge(
        df1,
        how='left',
        on=['round',
            'date',
            'first',
            'second',
            'third',
            'fourth',
            'fifth',
            'sixth']).fillna(0)

print(df)

print(df['Correct'])

df.to_csv("E:\file.csv")
print(df.head(20))

它是如何工作的?

merge 方法尝试将 dfdf1 中的列与 on 数组中存在的相同名称相匹配。当您 select left for how 参数时,合并 (df) 左侧的任何值都不会被删除(Left Join)。换句话说,我们在 file.csv 中创建的 correct 列附加到 Dupfil.csv 数据,并且不匹配被分配为 nan 值。 fillna(0) 方法帮助我们将 nan 值替换为 0.

pandas.DataFrame.merge API reference