os.walk/scandir 网络驱动器速度慢

Question

我试图在网络驱动器 O:\ 上找到所有 .xlsm 文件（并获取它们的统计信息），前提是它们不在名为 Test 的文件夹中。我使用 os.walk 并切换到 scandir.walk 因为它更快。我现在只是受到网络速度的限制。这段代码似乎在脚本和网络驱动器之间有很多交互。我的代码如下。有没有办法可以使用批处理文件来加快速度？我在 Windows.

from scandir import scandir, walk
import sys

def subdirs(path):
    for path, folders, files in walk(path):
        if 'Test' not in path:
            for sub_files in scandir(path):
                if '.xlsm' in sub_files.path:
                    yield subfiles.stat()

for i in subdirs('O:\'):
    print i

Answer 1

您正在双重扫描每条路径，一次通过 walk 隐式扫描，然后再次通过显式重新扫描 scandiring path walk return无缘无故地编辑。 walk 已经 return 编辑了 files，因此内部循环可以通过使用给定的内容来避免双重扫描：

def subdirs(path):
    for path, folders, files in walk(path):
        for file in files:
            if '.xlsm' in file:
                yield os.path.join(path, file)

为了解决更新后的问题，您可能需要复制现有的 scandir.walk 代码并将其修改为 return list 或 DirEntry，而不是list 的名字，或者为您的特定需求编写类似的特殊代码；无论哪种方式，这将允许您避免双重扫描，同时保持 scandir 的特殊低开销行为。例如：

def scanwalk(path, followlinks=False):
    '''Simplified scandir.walk; yields lists of DirEntries instead of lists of str'''
    dirs, nondirs = [], []
    for entry in scandir.scandir(path):
        if entry.is_dir(follow_symlinks=followlinks):
            dirs.append(entry)
        else:
            nondirs.append(entry)
    yield path, dirs, nondirs
    for dir in dirs:
        for res in scanwalk(dir.path, followlinks=followlinks):
            yield res

然后你可以像这样用它替换你对 walk 的使用（我还添加了修剪目录的代码 Test 因为它们下面的所有目录和文件都会被你的拒绝原始代码，但你仍然会不必要地遍历它们):

def subdirs(path):
    # Full prune if the path already contains Test
    if 'Test' in path:
        return
    for path, folders, files in scanwalk(path):
        # Remove any directory with Test to prevent traversal
        folders[:] = [d for d in folders if 'Test' not in d.name]
        for file in files:
            if '.xlsm' in file.path:
                yield file.stat()  # Maybe just yield file to get raw DirEntry?

for i in subdirs('O:\'):
    print i

顺便说一句，您可能需要仔细检查一下您是否正确 installed/built scandir、_scandir 的 C 加速器。如果未构建 _scandir，scandir 模块会使用 ctypes 提供回退实现，但它们的速度要慢得多，这可以解释性能问题。在交互式 Python 会话中尝试运行 import _scandir；如果它引发 ImportError，那么你没有加速器，所以你正在使用缓慢的回退实现。

os.walk/scandir 网络驱动器速度慢

os.walk/scandir slow on network drive

python

batch-file

os.walk

scandir

python-2.7