Python os.walk 复杂的目录标准

Question

我需要扫描包含数百或 GB 数据的目录，其中包含结构化部分（我想扫描）和非结构化部分（我不想扫描）。

阅读 os.walk 函数，我发现我可以使用一组中的一组条件来排除或包含某些目录名称或模式。

对于这个特定的扫描，我需要在目录中的每个级别添加特定的 include/exclude 标准，例如：

在根目录中，假设有两个有用的目录 'Dir A' 和 'Dir B' 以及一个无用的垃圾目录 'Trash'。在目录 A 中有两个有用的子目录 'Subdir A1' 和 'Subdir A2' 以及一个无用的 'SubdirA Trash' 目录，然后在目录 B 中有两个有用的子目录 Subdir B1 和 Subdir B2 加上一个无用的 'SubdirB Trash' 子目录。看起来像这样：

我需要每个级别都有一个特定的标准列表，如下所示：

level1DirectoryCriteria = set("Dir A","Dir B")

level2DirectoryCriteria = set("Subdir A1","Subdir A2","Subdir B1","Subdir B2")

我能想到的唯一方法显然是非 Python 的，使用复杂而冗长的代码，其中包含很多变量和不稳定的高风险。有没有人对如何解决这个问题有任何想法？如果成功，它可以一次保存代码运行几个小时。

Answer 1

您可以尝试这样的操作：

to_scan = {'set', 'of', 'good', 'directories'}
for dirpath, dirnames, filenames in os.walk(root):
    dirnames[:] = [d for d in dirnames if d in to_scan]
    #whatever you wanted to do in this directory

此解决方案很简单，如果您想要扫描具有特定名称的目录（如果它们出现在一个目录中而不是另一个目录中），此解决方案将失败。另一种选择是将目录名称映射到列表或白名单或黑名单目录集的字典。

编辑：我们可以使用 dirpath.count(os.path.sep) 来确定深度。

root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0
sets_by_level = [{'root', 'level'}, {'one', 'deep'}]
for dirpath, dirnames, filenames in os.walk(root):
    depth = dirpath.count(os.path.sep) - root_depth
    dirnames[:] = [d for d in dirnames if d in sets_by_level[depth]]
    #process this directory

Answer 2

不是关于 os.walk 的直接回答，只是一个建议：既然你无论如何都要扫描目录，而且你显然知道其他目录中的垃圾目录，你也可以在垃圾目录 skip_this_dir 之类的。当您遍历目录并创建文件列表时，您会检查 skip_this_dir 文件是否存在，例如 if 'skip_this_dir' in filenames: continue; 并继续下一次迭代。

这可能不涉及使用 os.walk 参数，但它确实使编程任务更容易管理，而不需要编写大量带有大量条件和列表的 'messy' 代码include/excludes。它还使脚本更易于重用，因为您不需要更改任何代码，只需将虚拟文件放在需要跳过的目录中即可。

Answer 3

通过使用 root.count(os.path.sep)，我能够针对结构中每个级别 include/exclude 的内容创建具体说明。看起来像这样：

import os

root_depth = root.count(os.path.sep) #subtract this from all depths to normalize root to 0

directoriesToIncludedByLevel = [{"criteriaString","criteriaString","criteriaString","criteriaString"},#Level 0
                               {"criteriaString","criteriaString","criteriaString" },#Level 1
                               {},#Level 2
                               ] 

directoriesToExcludedByLevel = [{}, #Level 0
                               {},  #Level 1
                               {"criteriaString"},  #Level 2
                                ]


for dirpath, dirnames, filenames in os.walk(root):

    depth = dirpath.count(os.path.sep) - root_depth

    # Here we create the dirnames path depending on whether we use the directoriesToIncludedByLevel or the directoriesToExcludedByLevel
    if depth == 2: #Where we define which directories to exclude
        dirnames[:] = [d for d in dirnames if d not in directoriesToExcludedByLevel[depth]]
    elif depth < 2 :  #Where we define which directories to INclude
        dirnames[:] = [d for d in dirnames if d in directoriesToIncludedByLevel[depth]]

Answer 4

我一直在寻找类似于 OP 的解决方案。我需要扫描子文件夹，并需要排除任何标记为 'trash' 的文件夹。我的解决方案是使用字符串 find() 方法。以下是我的使用方法：

for (dirpath, dirnames, filenames) in os.walk(your_path):
    if dirpath.find('trash') > 0:  
        pass
    elif dirpath.find('trash)') < 0:
        do_stuff

如果找到'trash'，那么它将return索引号。否则 find() 将 return -1.

您可以在此处找到有关 find() 方法的更多信息： https://www.tutorialspoint.com/python/string_find.htm

Python os.walk 复杂的目录标准

Python os.walk complex directory criteria

python

include

os.walk

python-3.x