如何使用 python 比较两个不同的 csv 文件？

Question

我想制作比较两个csv文件的代码！

import pandas as pd
import numpy as np

    df = pd.read_csv("E:\Dupfile.csv")
    df1 = pd.read_csv("E:\file.csv")
    
    df['Correct'] = None
    
    def Result(x):
       if ....:
         return int(1)
       else:
         return int(0)
    
    
    df.loc[:,"Correct"]=df.apply(Result,axis=1)
    
    print(df["Correct"])
    
    df.to_csv("E:\file.csv")
    print(df.head(20))

例如，file.csv格式如下：

     round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41     
1     2  2021.04      2       9     10      16     35     37      
2     3  2021.04      4      15     24      35     36     40      
3     4  2021.03     10      11     20      21     25     41     
4     5  2021.03      4       9     23      26     29     33     
5     6  2021.03      1       9     26      28     30     41

Dupfile.csv 如下所示：

    round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41  
0     1  2021.04      1       2      3       4      5      6    
1     2  2021.04      2       9     10      16     35     37   
1     2  2021.04      1       2      3       4      5      6      
2     3  2021.04      4      15     24      35     36     40    
2     3  2021.04      1       2      3       4      5      6     
3     4  2021.03     10      11     20      21     25     41  
3     4  2021.03      1       2      3       4      5      6     
4     5  2021.03      4       9     23      26     29     33  
4     5  2021.03      1       2      3       4      5      6

它还有一个相同的回合，但价值不同。

用 Dupfile 的 round 检查文件的 round 值，如果第一个到第六个值相等，则在 Dupfile 中创建另一个“正确”列并输入 1。如果不正确，将 0 放入“正确”列。

我试图比较两个不同的 csv 文件，但我不知道该怎么做。有人可以帮助我吗？

我的期望答案：

    round    date  first  second  third  fourth  fifth  sixth Correct
0     1  2021.04      1      14     15      24     40     41    1
0     1  2021.04      1       2      3       4      5      6    0
1     2  2021.04      2       9     10      16     35     37    1
1     2  2021.04      1       2      3       4      5      6    0  
2     3  2021.04      4      15     24      35     36     40    1
2     3  2021.04      1       2      3       4      5      6    0 
3     4  2021.03     10      11     20      21     25     41    1
3     4  2021.03      1       2      3       4      5      6    0 
4     5  2021.03      4       9     23      26     29     33    1
4     5  2021.03      1       2      3       4      5      6    0

Answer 1

你可以用 pandas 使用 df.merge.

查看示例：

import pandas as pd


# file.csv
file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("6", "2021.03", "1", "9", "26", "28", "30", "41"),
    ],
)

# adding control column (we already know that those are the right values)
file_df["correct"] = 1

# Dupfile.csv
dup_file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("1", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("2", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("3", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("4", "2021.03", "1", "2", "3", "4", "5", "6"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("5", "2021.03", "1", "2", "3", "4", "5", "6"),
    ],
)

# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]

# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth). 
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)

# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)

# Done!
print(merged)

Answer 2

如果使用pandas模块，最好获取模块中提供的方法。我建议您尝试使用 merge 来比较 2 个不同的数据帧。我重写你的代码如下。

import pandas as pd

df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")

df1['Correct'] = 1

df = df.merge(
        df1,
        how='left',
        on=['round',
            'date',
            'first',
            'second',
            'third',
            'fourth',
            'fifth',
            'sixth']).fillna(0)

print(df)

print(df['Correct'])

df.to_csv("E:\file.csv")
print(df.head(20))

它是如何工作的？

merge 方法尝试将 df 和 df1 中的列与 on 数组中存在的相同名称相匹配。当您 select left for how 参数时，合并 (df) 左侧的任何值都不会被删除（Left Join）。换句话说，我们在 file.csv 中创建的 correct 列附加到 Dupfil.csv 数据，并且不匹配被分配为 nan 值。 fillna(0) 方法帮助我们将 nan 值替换为 0.

pandas.DataFrame.merge API reference

如何使用 python 比较两个不同的 csv 文件？

How can I compare two different csv file using python?

python

csv

axis

compare

pandas

它是如何工作的？