Pythons os.walk() 访问所有文件夹而不是仅访问给定的文件夹

Question

我想使用一个简单的脚本来获取给定文件夹下的所有图像并比较 them/find 个重复项。

当解决方案的第一步已经存在时，为什么要发明轮子： Finding duplicate files and removing them

但是从访问给定 USB 闪存驱动器上的所有文件夹的意义上说，它在第一步就已经失败了。我去掉了所有散列的东西，我试图只获取文件列表，但即使这样会永远持续并访问 USB 驱动器上的每个文件。

from __future__ import print_function   # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys


folder_to_check = "D:\FileCompareTest"

def check_for_duplicates(paths, hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1, full_path_to_file2, ]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k, size_in_bytes): [full_path_to_file1, full_path_to_file2, ]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath, dirnames, filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath, filename)
                try:
                    # if the target is a symlink (soft one), this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions, etc) - pass on
                    continue




check_for_duplicates(folder_to_check)

我没有在几毫秒内获得 hashes_by_size 列表，而是陷入了一个永恒的循环，或者程序在数小时后退出，所有文件都在 USB 上。

关于 os.walk() 我有什么不明白的？

Answer 1

你应该打电话给

paths_to_check = []
paths_to_check.append(folder_to_check)
check_for_duplicates(paths_to_check)

按照您调用的方式，您会在路径的每个字符上获取生成器，而不是在正确的路径上。

Pythons os.walk() 访问所有文件夹而不是仅访问给定的文件夹

Pythons os.walk() visits all folders instead of only the given folder

python

os.walk

python-3.x