python - 处理大文件时出现问题
python - Issue in processing files with big size
基本上,我想为我的日常任务创建一个 Python 脚本,其中我想比较两个任意大小的文件,并希望从两个文件中生成两个具有匹配记录和不匹配记录的新文件.
我在下面编写了 python 脚本,发现它对于记录很少的文件大小可以正常工作。
但是当我对包含 200,000 条和 500,000 条记录的文件执行相同的脚本时,生成的结果文件未提供有效输出。
那么,您能否检查以下脚本并帮助确定其中导致错误输出的问题...?
提前致谢。
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
Edit-1 : 请注意上面的程序执行没有任何错误。只是输出不正确,程序需要更长的时间来处理大文件。
我会大胆猜测并假设 "no valid output" 表示:"runs forever and does nothing useful".
由于您对列表的理解,这将是合乎逻辑的:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
他们执行 O(n)
查找,这在少数行上是可以的,但如果 len(file1) == 100000
和 file2
就永远不会结束。然后执行 100000*100000 次迭代 => 10**10 => 永远。
修复很简单:创建 sets
并使用 intersection
和 difference
,速度更快。
file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)
基本上,我想为我的日常任务创建一个 Python 脚本,其中我想比较两个任意大小的文件,并希望从两个文件中生成两个具有匹配记录和不匹配记录的新文件.
我在下面编写了 python 脚本,发现它对于记录很少的文件大小可以正常工作。
但是当我对包含 200,000 条和 500,000 条记录的文件执行相同的脚本时,生成的结果文件未提供有效输出。
那么,您能否检查以下脚本并帮助确定其中导致错误输出的问题...?
提前致谢。
from sys import argv
script, filePathName1, filePathName2 = argv
def FileDifference(filePathName1, filePathName2):
fileObject1 = open(filePathName1,'r')
fileObject2 = open(filePathName2,'r')
newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
newFileObject1 = open(newFilePathName1,'a')
newFileObject2 = open(newFilePathName2,'a')
file1 = fileObject1.readlines()
file2 = fileObject2.readlines()
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for j in range(0,len(Matching)):
newFileObject2.write(Matching[j])
fileObject1.close()
fileObject2.close()
newFileObject1.close()
newFileObject2.close()
FileDifference(filePathName1, filePathName2)
Edit-1 : 请注意上面的程序执行没有任何错误。只是输出不正确,程序需要更长的时间来处理大文件。
我会大胆猜测并假设 "no valid output" 表示:"runs forever and does nothing useful".
由于您对列表的理解,这将是合乎逻辑的:
Differece = [ diff for diff in file1 if diff not in file2 ]
for i in range(0,len(Differece)):
newFileObject1.write(Differece[i])
Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
newFileObject2.write(Matching[i])
他们执行 O(n)
查找,这在少数行上是可以的,但如果 len(file1) == 100000
和 file2
就永远不会结束。然后执行 100000*100000 次迭代 => 10**10 => 永远。
修复很简单:创建 sets
并使用 intersection
和 difference
,速度更快。
file1 = set(fileObject1.readlines())
file2 = set(fileObject2.readlines())
difference = file1 - file2
for i in difference:
newFileObject1.write(i)
matching = file1 & file2
for i in matching:
newFileObject2.write(matching)