Python : 比较两个大文件

Question

这是 Compare two large files which is answerd by phihag

的跟进问题

我想显示比较两个文件后不同的行数。想要在程序完成后显示为消息行数不同。

我的尝试:

with open(file2) as b:
  blines = set(b)
with open(file1) as a:
  with open(file3, 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

lines_to_write = []
with open(file2) as b:
  blines = set(b)
with open(file1) as a:
  lines_to_write = [l for l in a if l not in blines]

print('count of lines are in difference:', len(lines_to_write))

Answer 1

edit：这个答案假定您想要比较两个文件中的相应行。如果那不是您想要的，请忽略此答案。我会留在这里供以后的读者使用。

如果您只想计算行数，请避免创建大型列表。文件是内存高效的迭代器，您的任务不需要比一次查看两行所需的内存更多的内存。

演示（有两个假文件）

>>> fake_file_1 = '''1
... 2
... 3'''.splitlines()
>>> 
>>> fake_file_2 = '''1
... 1
... 3
... 4'''.splitlines()

我假设您在这里想要答案 2，因为第二行不同并且 fake_file_2 有一个额外的第四行。

>>> from itertools import zip_longest # izip_longest in Python2
>>> sum(1 for line1, line2 in zip_longest(fake_file_1, fake_file_2, fillvalue=float('nan')) 
...     if line1 != line2)
2

zip_longest 的工作方式与 zip 类似，将从两个文件中生成成对的对应行。此外，如果一个文件更长，则插入填充值 float('nan')，它总是比较不等于任何东西（当然，您可以只使用任何其他虚拟值，如 0，但我喜欢这种方式).

不要使用假文件，只需使用您实际打开的文件的句柄即可。

Answer 2

如果可以将所有内容加载到内存中，则可以对集合执行以下操作：

union = set(alines).union(blines)
intersection = set(alines).intersection(blines)
unique = union - intersection

编辑： 更简单（更快）的是：

set(alines).symmetric_difference(blines)

Answer 3

我提出一个基于pandas的解决方案。

import pandas as pd

1.创建两个 pandas 数据帧

df1 = pd.read_csv(filepath_1)
df2 = pd.read_csv(filepath_2)

2。如果您的句子包含任何潜在的分隔符，请将所有列合并为一个

df1 = df1.astype(str).apply(''.join)
df2 = df2.astype(str).apply(''.join)

3。将两个帧合并为一个

frames = [df1, df2]
df_merged = pd.concat(frames)

4.删除所有重复项的两个副本

df_unique = df_merged.drop_duplicates(keep = False)

5.计数并打印结果

print('count of lines are in difference:', len(df_unique))

Python : 比较两个大文件

Python : Compare two large files

python

large-files

large-data