Python 用于删除多个文件中的重复行的脚本

Question

我有大量 .txt 文件，每个文件都包含一个 URL 列表。每个文件中都有重复的 URL。文件之间没有重复。我想从每个文件中删除重复的 URL。

我写了一个脚本，可以在单个文件上正常工作。现在我想运行针对大量文件。

path = "/users/mypath"
myfiles = os.listdir(path)
for f in myfiles:
       open(f, 'r')
       lines = f.readlines()
       seen_lines = set()
       open(f, 'w')
       for line in lines:
              if line not in seen_lines:
                   seen_lines.add(line)
                   f.write(line)
       f.close()

这会产生错误消息：

File "C:\Users\myscripts\myscript.py", line 66, in <module>
    lines=open(f,'r').readlines()
FileNotFoundError: [Errno 2] No such file or directory: 'myfile.txt'

我想我没有正确定义路径 - 有什么建议吗？

Answer 1

是的，你应该在打开文件时添加相对路径（目录路径），这样

cur_file = open(os.path.join(path, f), 'r')

同样打开文件进行写入，注意f是一个字符串，不会有readlines，你应该readlines来自[返回的对象=16=]，写成

也是一样

顺便说一下，如果你使用 set 就不需要检查一行是否被写入，因为一个集合不允许重复，你可以简单地将所有行添加到一个集合中，并且然后将集合写入输出文件

output_file.write("\n".join(seen_lines))

而且我认为最有效的方法是

for f in myfiles:
    cur_file = open(os.path.join(path, f), 'r')
    lines = set(cur_file.readlines())
    cur_file.close()
    with open(os.path.join(path, f), 'w') as of:
        of.write("\n".join(lines))

Answer 2

您必须将 path 连接到文件名才能构建完整路径并将其分配给变量。让我们假设文件名存储在变量 filename 中。因此，可以使用

打开文件

f = open(os.path.join(path, filename), 'r')

当您以 w 模式打开文件时，您应该应用相同的方法，请记住在脚本开头 import os

对您的代码的补充评论...您不应该继续累积 seen_lines，因为如果您的文件中有很多行，它将需要大量 RAM 内存。相反，计算每行的 hash 并将它们累积在一个集合中。它也将比您当前的代码快得多。

总而言之，我将使用以下代码：

path = "/users/mypath"
myfiles = os.listdir(path)
for filename in myfiles:
       f = open(os.path.join(path, filename), 'r')
       lines = f.readlines()
       seen_lines = set()
       f.close()
       f = open(os.path.join(path, filename), 'w')
       for line in lines:
              h = hash(line)
              if h not in seen_lines:
                   seen_lines.add(h)
                   f.write(line)
       f.close()

Answer 3

尝试替换

   open(f, r)

和

   f=open(os.path.join(path, f), 'r')

Python 用于删除多个文件中的重复行的脚本

Python script to deduplicate lines in multiple files

python

file-handling