尝试比较两个 csv 文件并将差异写入输出

Question

我正在开发一个脚本，该脚本采用 2 个 csv 文件之间的差异，并生成一个新的 csv 文件作为输出，但只有当两个输入文件之间的相同 2 行（指行号）包含不同时数据例如第 3 行在文件 1 中有 "mike"、"basketball player"，文件 2 中的第 3 行有 "mike"、"baseball player"。输出 csv 将抓取这些打印出来并将它们写入 csv。它有效，但存在一些问题（我知道这个问题之前也被问过几次，但其他人对我的做法有所不同，而且由于我对编程还很陌生，所以我不太了解他们的代码）。

新 csv 文件中的输出在每个单元格中都有输出的每个字母（见下图），我相信它与 delimiter/quotechar/quoting 第 37 行有关。我希望它们在自己的单元格中没有任何句号、多个空格、逗号或“|”。

另一个问题是运行需要很长时间。我正在处理多达 50,000 行的数据集，它可能需要一个多小时才能完成运行。为什么会这样，有什么建议可以加快它的速度？也许把一些东西放在 for 循环之外？我之前确实尝试过 difflib 方法，但我只能打印整个 "input_file1" 但无法将该文件与另一个文件进行比较。

# aim of script is to compare csv files and output difference as a new csv

# import necessary libraries
import csv

# File1 = open(raw_input("path:"),"r") #filename, mode
# File2 = open(raw_input("path:"),"r") #filename, mode

# selects the 2 input files to be compared
input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv"
input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv"
# creates the blank output csv file
output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv"
a = open(input_file1, "r")
output_file = open(output_path,"w")
output_file.close()
count = 0

with open(input_file1) as fp1:


    for row_number1, row_value1 in enumerate(fp1):
        if row_number1 == count:
            print "got to 1st point"
            value1 = row_value1

            with open(input_file2) as fp2:
                for row_number2, row_value2 in enumerate(fp2):
                    if row_number2 == count:
                        print "got to 2nd point"
                        value2 = row_value2

                        if value1 == value2:
                            print value1, value2
                        else:
                            print value1, value2
                            with open(output_path, 'wb') as f:
                                writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
                                # testing to see if the code writes text to the csv
                                writer.writerow(["test1"])
                                writer.writerow(["test2", "test3", "test4"])
                                writer.writerows([value1, value2])
                                print "code reached writing stage"
        count += 1
        print count
print "done"
# replace(",",".")

Answer 1

既然你想比较两个文件line-by-line，你应该而不是循环遍历第二个文件每行在第一个文件中。您可以简单地 zip 两个 csv 阅读器并过滤行：

input_file1 = "foo"
input_file2 = "bar"
output_path = "baz"

with open(input_file1) as fin1:
  with open(input_file2) as fin2:
    read1 = csv.reader(fin1)
    read2 = csv.reader(fin2)
    diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
    with open(output_path, 'w') as fout:
      writer = csv.writer(fout)
      writer.writerows(diff_rows)

此解决方案假定两个文件具有相同的行数。

尝试比较两个 csv 文件并将差异写入输出

Trying to compare two csv files and write differences as output

python

csv

compare

difference