Python - 查找重复文件并移动到另一个文件夹

Python - Find duplicate files and move to another folder

正在编写一个小脚本,允许用户选择一个文件夹来搜索重复文件,无论是图像、文本等。然后应该将这些重复文件移动到用户选择的另一个文件夹中。

这是我目前的代码:

from tkinter import Tk
from tkinter.filedialog import askdirectory

import os
import shutil

import hashlib

Tk().withdraw()

source = askdirectory(title="Select the source folder")

walker = os.walk(source)
uniqueFiles = dict()
total = 0

for folder, sub_folder, files in walker:
    for file in files:
        filepath = os.path.join(folder, file)
        filehash = hashlib.md5((open(filepath, "rb").read())).hexdigest()

        if filehash in uniqueFiles:
            print(f"{filepath} is a duplicate")
            total += 1
        else:
            uniqueFiles[filehash] = source

    print("\n# of duplicate files found: {} ".format(total))

    # destination = askdirectory(title="Select the target folder")
    # shutil.move(filepath, destination, copy_function=shutil.copytree)

它现在工作得很好,找到 folder/sub 文件夹中的所有重复文件并将它们打印出来。我卡住的部分是如何移动它们。底部的注释代码似乎有效,但它会提示用户为找到的每个重复项提供一个文件夹。我只是想让它列出所有重复项,然后立即移动它们。

关于如何格式化我的代码有什么想法吗?

谢谢!

所以你在这里有两个选择(如我对你问题的评论所述):

  1. 预先提示目标目录
  2. 之后提示目标目录

第一个选项可能是最简单、最有效的,并且需要最少的重构。但是,它确实需要用户输入目标目录天气是否存在任何重复文件或搜索时发生错误,因此从用户的角度来看可能更糟:

# prompt for directory beforehand
destination = askdirectory(title="Select the target folder")

for folder, sub_folder, files in walker:
    for file in files:
        filepath = os.path.join(folder, file)
        filehash = hashlib.md5(open(filepath, "rb").read()).hexdigest()

        if filehash in uniqueFiles:
            shutil.move(filepath, destination, copy_function=shutil.copytree)
        else:
            uniqueFiles[filehash] = source

第二个选项允许您执行所有必要的检查和错误处理,但更复杂并且需要更多的重构:

# dictionary of hashes to all files
hashes = {}

for folder, sub_folder, files in walker:
    for file in files:
        filepath = os.path.join(folder, file)
        filehash = hashlib.md5(open(filepath, "rb").read()).hexdigest()

        if filehash in hashes
            hashes[filehash].append(filepath)
        else:
            hashes[filehash] = [filepath]

# prompt for directory beforehand
destination = askdirectory(title="Select the target folder")

for duplicates in hashes.values():
    if len(duplicates) < 2:
        continue

    for duplicate in hashes:
        shutil.move(duplicate, destination, copy_function=shutil.copytree)

附带说明一下,我不熟悉 hashlib 但我怀疑您会想要关闭正在散列的文件,尤其是在检查大型文件树时:

with open(filepath, "rb") as file:
    filehash = hashlib.md5(file.read()).hexdigest()