在 PyPy3 中读取许多输入的最快方法以及 BytesIO 在这里做什么？

Question

最近我在做一个问题，需要我阅读很多行数字（大约 500,000）。

早些时候，我发现使用 input() 太慢了。使用 stdin.readline() 会好很多。然而，它仍然不够快。我发现使用以下代码：

import io, os
input = io.BytesIO(os.read(0,os.fstat(0).st_size)).readline

并以这种方式使用 input() 改进了运行时间。但是，我实际上并不了解这段代码是如何工作的。 Reading the documentation for os.read, 0 in os.read(0, os.fstat(0).st_size) 描述了我们正在读取的文件。 0 描述的是什么文件？此外，fstat 描述了我们正在读取的文件的状态，但显然输入是表示我们正在读取的最大字节数？

代码有效，但我想了解它在做什么以及为什么它更快。感谢任何帮助。

Answer 1

0 是标准输入的文件描述符。 os.fstat(0).st_size 将告诉 Python 当前有多少字节在标准输入缓冲区中等待。然后 os.read(0, ...) 将再次从标准输入批量读取那么多字节，生成字节串。

（补充说明，1是标准输出的文件描述符，2是标准错误。）

这是一个演示：

echo "five" | python3 -c "import os; print(os.stat(0).st_size)"
# => 5

Python在标准输入缓冲区中发现了四个单字节字符和一个换行符，并报告了五个字节等待读取。

如果你想要文本，字节串使用起来不是很方便 — 一方面，他们并不真正理解 "lines" 的概念 — 所以 BytesIO 伪造一个输入流bytestring，允许您从中 readline 。我不是 100% 确定为什么这样更快，但我的猜测是：

正常读取很可能是按字符进行的，这样就可以检测到换行符并在不读取太多的情况下停止；批量读取效率更高（并且在内存中查找换行符 post-facto 非常快）
这种方式没有做任何编码处理

Answer 2

os.read有一个签名我叫fd，size。将 size 设置为 fd 中剩余的字节会导致其他所有内容像海啸一样冲向您。还有 "standard file descriptors" for 0=stdin, 1=stdout, 2=stderr.

代码解构：

import io, os # Utilities
input = \ # Replace the input built-in
  io.BytesIO( \ # Create a fake file
    os.read( \ # Read data from a file descriptor
      0, \ # stdin
      os.fstat(0) \ # Information about stdin
        .st_size \ # Bytes left in the file
    )
  ) \
  .readline # When called, gets a line of the file

在 PyPy3 中读取许多输入的最快方法以及 BytesIO 在这里做什么？

Fastest way to read many inputs in PyPy3 and what is BytesIO doing here?

pypy

python-3.x