尝试比较两个 csv 文件并将差异写入输出
Trying to compare two csv files and write differences as output
我正在开发一个脚本,该脚本采用 2 个 csv 文件之间的差异,并生成一个新的 csv 文件作为输出,但只有当两个输入文件之间的相同 2 行(指行号)包含不同时数据例如第 3 行在文件 1 中有 "mike"、"basketball player",文件 2 中的第 3 行有 "mike"、"baseball player"。输出 csv 将抓取这些打印出来并将它们写入 csv。它有效,但存在一些问题(我知道这个问题之前也被问过几次,但其他人对我的做法有所不同,而且由于我对编程还很陌生,所以我不太了解他们的代码)。
新 csv 文件中的输出在每个单元格中都有输出的每个字母(见下图),我相信它与 delimiter/quotechar/quoting 第 37 行有关。我希望它们在自己的单元格中没有任何句号、多个空格、逗号或“|”。
另一个问题是运行需要很长时间。我正在处理多达 50,000 行的数据集,它可能需要一个多小时才能完成 运行。为什么会这样,有什么建议可以加快它的速度?也许把一些东西放在 for 循环之外?我之前确实尝试过 difflib 方法,但我只能打印整个 "input_file1" 但无法将该文件与另一个文件进行比较。
# aim of script is to compare csv files and output difference as a new csv
# import necessary libraries
import csv
# File1 = open(raw_input("path:"),"r") #filename, mode
# File2 = open(raw_input("path:"),"r") #filename, mode
# selects the 2 input files to be compared
input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv"
input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv"
# creates the blank output csv file
output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv"
a = open(input_file1, "r")
output_file = open(output_path,"w")
output_file.close()
count = 0
with open(input_file1) as fp1:
for row_number1, row_value1 in enumerate(fp1):
if row_number1 == count:
print "got to 1st point"
value1 = row_value1
with open(input_file2) as fp2:
for row_number2, row_value2 in enumerate(fp2):
if row_number2 == count:
print "got to 2nd point"
value2 = row_value2
if value1 == value2:
print value1, value2
else:
print value1, value2
with open(output_path, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
# testing to see if the code writes text to the csv
writer.writerow(["test1"])
writer.writerow(["test2", "test3", "test4"])
writer.writerows([value1, value2])
print "code reached writing stage"
count += 1
print count
print "done"
# replace(",",".")
既然你想比较两个文件line-by-line,你应该而不是循环遍历第二个文件每行在第一个文件中。您可以简单地 zip
两个 csv 阅读器并过滤行:
input_file1 = "foo"
input_file2 = "bar"
output_path = "baz"
with open(input_file1) as fin1:
with open(input_file2) as fin2:
read1 = csv.reader(fin1)
read2 = csv.reader(fin2)
diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
with open(output_path, 'w') as fout:
writer = csv.writer(fout)
writer.writerows(diff_rows)
此解决方案假定两个文件具有相同的行数。
我正在开发一个脚本,该脚本采用 2 个 csv 文件之间的差异,并生成一个新的 csv 文件作为输出,但只有当两个输入文件之间的相同 2 行(指行号)包含不同时数据例如第 3 行在文件 1 中有 "mike"、"basketball player",文件 2 中的第 3 行有 "mike"、"baseball player"。输出 csv 将抓取这些打印出来并将它们写入 csv。它有效,但存在一些问题(我知道这个问题之前也被问过几次,但其他人对我的做法有所不同,而且由于我对编程还很陌生,所以我不太了解他们的代码)。
新 csv 文件中的输出在每个单元格中都有输出的每个字母(见下图),我相信它与 delimiter/quotechar/quoting 第 37 行有关。我希望它们在自己的单元格中没有任何句号、多个空格、逗号或“|”。
另一个问题是运行需要很长时间。我正在处理多达 50,000 行的数据集,它可能需要一个多小时才能完成 运行。为什么会这样,有什么建议可以加快它的速度?也许把一些东西放在 for 循环之外?我之前确实尝试过 difflib 方法,但我只能打印整个 "input_file1" 但无法将该文件与另一个文件进行比较。
# aim of script is to compare csv files and output difference as a new csv
# import necessary libraries
import csv
# File1 = open(raw_input("path:"),"r") #filename, mode
# File2 = open(raw_input("path:"),"r") #filename, mode
# selects the 2 input files to be compared
input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv"
input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv"
# creates the blank output csv file
output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv"
a = open(input_file1, "r")
output_file = open(output_path,"w")
output_file.close()
count = 0
with open(input_file1) as fp1:
for row_number1, row_value1 in enumerate(fp1):
if row_number1 == count:
print "got to 1st point"
value1 = row_value1
with open(input_file2) as fp2:
for row_number2, row_value2 in enumerate(fp2):
if row_number2 == count:
print "got to 2nd point"
value2 = row_value2
if value1 == value2:
print value1, value2
else:
print value1, value2
with open(output_path, 'wb') as f:
writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
# testing to see if the code writes text to the csv
writer.writerow(["test1"])
writer.writerow(["test2", "test3", "test4"])
writer.writerows([value1, value2])
print "code reached writing stage"
count += 1
print count
print "done"
# replace(",",".")
既然你想比较两个文件line-by-line,你应该而不是循环遍历第二个文件每行在第一个文件中。您可以简单地 zip
两个 csv 阅读器并过滤行:
input_file1 = "foo"
input_file2 = "bar"
output_path = "baz"
with open(input_file1) as fin1:
with open(input_file2) as fin2:
read1 = csv.reader(fin1)
read2 = csv.reader(fin2)
diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2)
with open(output_path, 'w') as fout:
writer = csv.writer(fout)
writer.writerows(diff_rows)
此解决方案假定两个文件具有相同的行数。