在 Python 中逐行读取多个文件的最快方法

Question

我有一个概念性的问题。我是 Python 的新手，我正在寻找涉及处理更大日志文件的任务。其中一些可以达到 5 和 6GB

我需要解析一个位置中的许多文件。这些是文本文件。

我知道 with open() 方法，最近才运行进入 pathlib。所以我不仅需要逐行读取文件以提取值以上传到数据库中，我还需要获取 Pathlib 为您提供的文件属性并上传它们。

使用 open 和在它下面调用一个路径对象来读取文件是否更快...像这样：

for filename in glob('**/*.*', recursive=False):
    fpath = Path(filename)
    with open(filename, 'rb', buffering=102400) as logfile:
        for line in logfile:
            #regex operation
            print(line)

或者使用 Pathlib 会更好:

with Path("src/module.py") as f:
    contents = open(f, "r")
    for line in contents:
        #regex operation
        print(line)

此外，因为我从未使用 Pathlib 打开文件进行阅读。说到这里：Path.open(mode=’r’, buffering=-1, encoding=None, errors=None, newline=None)

换行和错误是什么意思？我假设这里的缓冲与 with open 函数中的缓冲相同？

我还看到了这个与 open 一起使用的装置，它与 Path 对象结合使用，虽然它是如何工作的，但我不知道：

path = Path('.editorconfig')
with open(path, mode='wt') as config:
    config.write('# config goes here')

Answer 1

pathlib 旨在成为与文件系统交互的更优雅的解决方案，但这不是必需的。它会增加少量固定开销（因为它包装了其他较低级别的 API），但不应以任何有意义的方式改变性能扩展方式。

因为，如前所述，pathlib 主要是对较低级别 API 的包装，您应该知道 Path.open 是根据 open 实现的，并且所有参数的含义都相同双方的事情;阅读 the docs for the built-in open 将描述论点。

至于你问题的最后一点（将 Path 对象传递给内置的 open），这是可行的，因为大多数与文件相关的 API 都已更新以支持实现的任何对象the os.PathLike ABC.

在 Python 中逐行读取多个文件的最快方法

Fastest Method to read many files line by line in Python

python

python-3.x

pathlib