两个文件比较

Question

我有一个非常奇怪的问题。我有三个文件，其中包含一列数字。我只需要从第一个文件中获取唯一值，这些值不存在于第二个和第三个文件中。

我试过 Python 喜欢：

for e in firstfile:
    if e not in secondfile:
        resultfile.append(e)
return resultfile

第三个文件也一样。

我在 linux shell 中尝试了 uniq、sort、diff、一些 awk 脚本和 comm，如下所示：Fast way of finding lines in one file that are not in another?

但我每次得到的唯一结果是与之前的第一个文件中相同的行数。我完全不明白！

也许，我错过了什么？也许这是一种格式？但是，我检查了很多次。以下是文件：http://dropmefiles.com/BaKGj

P.S。后来我以为根本没有唯一行，但我手动检查了一下，第一个文件中的一些数字是唯一的。

P.P.S。文件格式如下：

380500100000 
380500100001 
380500100002 
380500100003 
380500100004    
380500100005 
380500100008 
380500100020 
380500100022 
380500100050    
380500100070 
380500100080

Answer 1

最简单的方法是将每个文件读入 set，然后使用 Python 的（非常有效的）集合操作来进行比较。

file1 = set()
file2 = set()

for element in firstfile:
    file1.add(element)

for element in secondfile:
    file2.add(element)

unique = file1 - file2

Answer 2

怎么了

And same for third file

如果你真的对第三个文件做同样的事情，即将第一个文件的原始内容与第三个文件进行比较，你可以引入不在第二个文件中但在第三个文件中的项目的重复项。例如：

file 1:
1
2
3

file 2:
1

file 3:
2

处理文件 2 后，resultfile 将包含 2 和 3。然后在处理文件 3 后，resultfile 将包含 2 和 3（从第一个运行开始）plus 1 和 3，即 2, 3, 1, 3。但是，结果应该只是 3。

从您的代码中不清楚您是否真的在写每个运行和文件 resultfile 的输出。如果是，则应将其用作第二个和后续运行的输入，不要再次处理第一个文件。

更好的修复方法

如果您不需要保留第一个文件中的行顺序，您可以像这样使用 set.difference()：

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(f1).difference(f2, f3)

请注意，这将包括文件中存在的所有空格（包括换行符）。如果您想忽略每一行的前导和尾随空格：

from itertools import chain

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(map(str.strip, f1)).difference(map(str.strip, chain(f2, f3)))

以上假定 Python 3。如果您正在使用 Python 2，那么为了提高效率，可选择导入 itertools.imap 并使用它代替 map()。

或者您可能希望将数据视为数字（我在这里假设 float，但您可以改用 int）：

from itertools import chain

with open('file1') as f1, open('file2') as f2, open('file3') as f3:
    unique_f1 = set(map(float, f1)).difference(map(float, chain(f2, f3)))

Answer 3

问题可能在于 first.csv 是严格的 ASCII 文本，而 second.csv 和 third.csv 是 ASCII 文本，带有 CRLF 行终止符。我建议您将它们更改为相同的格式（ASCII 文本可能效果最好）。

$ file first.csv
first.csv: ASCII text 

$ file second.csv
second.csv: ASCII text, with CRLF line terminators

$ file third.csv
third.csv: ASCII text, with CRLF line terminators

两个文件比较

Two files comparsion

python

list

shell

comparison