python - 处理大文件时出现问题

python - Issue in processing files with big size

基本上,我想为我的日常任务创建一个 Python 脚本,其中我想比较两个任意大小的文件,并希望从两个文件中生成两个具有匹配记录和不匹配记录的新文件.

我在下面编写了 python 脚本,发现它对于记录很少的文件大小可以正常工作。

但是当我对包含 200,000 条和 500,000 条记录的文件执行相同的脚本时,生成的结果文件未提供有效输出。

那么,您能否检查以下脚本并帮助确定其中导致错误输出的问题...?

提前致谢。

from sys import argv

script, filePathName1, filePathName2  = argv

def FileDifference(filePathName1, filePathName2):
    fileObject1 = open(filePathName1,'r')
    fileObject2 = open(filePathName2,'r')
    newFilePathName1 = filePathName1 + ' - NonMatchingRecords.txt'
    newFilePathName2 = filePathName1 + ' - MatchingRecords.txt'
    newFileObject1 = open(newFilePathName1,'a')
    newFileObject2 = open(newFilePathName2,'a')
    file1 = fileObject1.readlines()
    file2 = fileObject2.readlines()
    Differece = [ diff for diff in file1 if diff not in file2 ]
    for i in range(0,len(Differece)):
        newFileObject1.write(Differece[i])

    Matching = [ match for match in file1 if match in file2 ]
    for j in range(0,len(Matching)):
        newFileObject2.write(Matching[j])
    fileObject1.close()
    fileObject2.close()
    newFileObject1.close()
    newFileObject2.close()

FileDifference(filePathName1, filePathName2)

Edit-1 : 请注意上面的程序执行没有任何错误。只是输出不正确,程序需要更长的时间来处理大文件。

我会大胆猜测并假设 "no valid output" 表示:"runs forever and does nothing useful".

由于您对列表的理解,这将是合乎逻辑的:

    Differece = [ diff for diff in file1 if diff not in file2 ]
    for i in range(0,len(Differece)):
        newFileObject1.write(Differece[i])

Matching = [ match for match in file1 if match in file2 ]
for i in range(0,len(Matching)):
    newFileObject2.write(Matching[i])

他们执行 O(n) 查找,这在少数行上是可以的,但如果 len(file1) == 100000file2 就永远不会结束。然后执行 100000*100000 次迭代 => 10**10 => 永远。

修复很简单:创建 sets 并使用 intersectiondifference,速度更快。

    file1 = set(fileObject1.readlines())
    file2 = set(fileObject2.readlines())
    difference = file1 - file2
    for i in difference:
        newFileObject1.write(i)

matching = file1 & file2
for i in matching:
    newFileObject2.write(matching)