如何从使用 Python subprocess.Popen() 或 open() 读取的文件中设置读取行的 "chunk size"？

Question

我有一个相当大的文本文件，我想运行分块。为了使用 subprocess 库执行此操作，可以执行以下 shell 命令：

"cat hugefile.log"

使用代码：

import subprocess
task = subprocess.Popen("cat hugefile.log", shell=True,  stdout=subprocess.PIPE)
data = task.stdout.read()

使用print(data)会一次性吐出文件的全部内容。如何显示块数，然后按块大小访问此文件的内容（例如，一次块=三行）。

它必须是这样的：

chunksize = 1000   # break up hugefile.log into 1000 chunks

for chunk in data:
    print(chunk)

与Python等价的问题open()当然使用代码

with open('hugefile.log', 'r') as f:
     read_data = f.read()

你会如何read_data分块？

Answer 1

使用文件，可以迭代文件句柄（不需要子进程打开cat）：

with open('hugefile.log', 'r') as f:
     for read_line in f:
        print(read_line)

Python 通过读取直到 \n 的所有字符来读取一行。逐行模拟I/O，调用3次即可。或读取并计数 3 \n 个字符，但您必须处理文件末尾等...不是很有用，这样做不会获得任何速度。

with open('hugefile.log', 'r') as f:
     while True:
        read_3_lines = ""
        try:
           for i in range(3):
               read_3_lines += next(f)
        # process read_3_lines
        except StopIteration:  # end of file
            # process read_3_lines if nb lines not divisible by 3
            break

使用 Popen 你可以做完全相同的事情，作为奖励添加 poll 来监控过程（不需要 cat 但我想你的过程是不同的，那就是仅用于问题目的）

import subprocess
task = subprocess.Popen("cat hugefile.log", shell=True,  stdout=subprocess.PIPE)
while True:
    line = task.stdout.readline()
    if line == '' and task.poll() != None: break

rc = task.wait()   # wait for completion and get return code of the command

Python 3 支持编码的兼容代码：

    line = task.stdout.readline().decode("latin-1")
    if len(line) == 0 and task.poll() != None: break

现在，如果您想将文件拆分为给定数量的块：

您不能使用 Popen，原因很明显：您必须先知道输出的大小
如果您有一个文件作为输入，您可以执行以下操作：

代码：

import os,sys
filename = "hugefile.log"
filesize = os.path.getsize(filename)
nb_chunks = 1000
chunksize = filesize // nb_chunks

with open(filename,"r") as f:
   while True:
      chunk = f.read(chunksize)
      if chunk=="":
          break
      # do something useful with the chunk
      sys.stdout.write(chunk)

如何从使用 Python subprocess.Popen() 或 open() 读取的文件中设置读取行的 "chunk size"？

How to set the "chunk size" of read lines from file read with Python subprocess.Popen() or open()?

python

bash

shell

subprocess

chunking