如何提高在特定日期创建的文件夹中查找所有文件的性能？

Question

一个文件夹中有 10,000 个文件。 2018-06-01创建的文件很少，2018-06-09创建的文件很少，就这样。

我需要找到在 2018-06-09 创建的所有文件。但是读取每个文件并获取文件创建日期然后获取在 2018-06-09 创建的文件需要花费很多时间（将近 2 小时）。

for file in os.scandir(Path):
    if file.is_file():
        file_ctime = datetime.fromtimestamp(os.path.getctime(file)).strftime('%Y- %m- %d %H:%M:%S')
        if file_ctime[0:4] == '2018-06-09':
            # ...

Answer 1

您可以尝试使用 os.listdir(path) 从给定路径获取所有文件和目录。

获得所有文件和目录后，您可以使用 filter 和 lambda 函数创建一个新列表，其中仅包含具有所需时间戳的文件。

然后您可以遍历该列表以对正确的文件执行所需的操作。

Answer 2

让我们从最基本的事情开始 - 为什么构建 datetime 只是为了将其重新格式化为字符串然后进行字符串比较？

然后就是使用os.scandir() over os.listdir() - os.scandir() returns a os.DirEntry which caches file stats through the os.DirEntry.stat()调用的重点了。

根据您需要执行的检查，os.listdir() might even perform better if you expect to do a lot of filtering on the filename as then you won't need to build up a whole os.DirEntry只是将其丢弃。

因此，要优化循环，如果您不希望对名称进行大量过滤：

for entry in os.scandir(Path):
    if entry.is_file() and 1528495200 <= entry.stat().st_ctime < 1528581600:
        pass  # do whatever you need with it

如果你这样做，那么最好坚持使用 os.listdir() 作为：

import stat

for entry in os.listdir(Path):
    # do your filtering on the entry name first...
    path = os.path.join(Path, entry)  # build path to the listed entry...
    stats = os.stat(path)  # cache the file entry statistics
    if stat.S_ISREG(stats.st_mode) and 1528495200 <= stats.st_ctime < 1528581600:
        pass  # do whatever you need with it

如果想灵活使用时间戳，直接使用datetime.datetime.timestamp() beforehand to get the POSIX timestamps and then you can compare them against what stat_result.st_ctime returns，无需转换。

但是，对于仅仅 10k 个条目，即使是原始的、未优化的方法也应该比 2 小时快得多。我也会检查底层文件系统，那里似乎有问题。

如何提高在特定日期创建的文件夹中查找所有文件的性能？

How can I improve performance of finding all files in a folder created at a certain date?

python

performance

scandir