在 Python (SageMath 9.0) 中 - 1B 行的文本文件 - 从特定行读取的最佳方式

Question

我是运行 SageMath 9.0，在 Windows 10 OS

我在该网站上阅读了几个类似的问题（和答案）。主要是 this one one reading from the 7th line, and 优化。但我有一些具体问题：我需要了解如何从特定（可能非常远）的行中以最佳方式读取，以及我是否应该逐行读取，或者在我的情况下按块读取是否“更优化”。

我有一个 12Go 文本文件，由大约 10 亿条小行组成，全部由 ASCII 可打印字符组成。每行都有固定数量的字符。这是实际的前 5 行：

J??????????
J???????C??
J???????E??
J??????_A??
J???????F??
...

对于上下文，此文件是 11 个顶点上的所有非同构图的列表，使用 graph6 格式编码。该文件已由 Brendan McKay on its webpage here.

计算并提供。

我需要检查每个图表的某些属性。我可以使用生成器 for G in graphs(11) 但这可能会很长（至少在我的笔记本电脑上需要几天）。我想在文件中使用完整的数据库，这样我就可以停止并从某个点重新开始。

我目前的代码从头开始逐行读取文件，并在读取每一行后进行一些计算：

with open(filename,'r') as file:
    while True: 
        # Get next line from file 
        line = file.readline() 

        # if line is empty, end of file is reached 
        if not line: 
            print("End of Database Reached")
            break  
        
        G = Graph()
        from_graph6(G,line.strip())

        run_some_code(G)

为了能够停止代码，或者在崩溃的情况下保存进度，我在想：

每读取一百万行（或左右），将进度保存在特定文件中
重新启动代码时，读取最后保存的值，而不是使用 line = file.readline()，我会使用 itertool 选项，for line in islice(file, start_line, None).

所以我的新密码是

 from itertools import islice
 start_line = load('foo')
 count = start_line 
 save_every_n_lines = 1000000


 with open(filename,'r') as file:
     for line in islice(file, start_line, None):
         G = Graph()
         from_graph6(G,line.strip())

         run_some_code(G)
         count +=1

         if (count % save_every_n_lines )==0:
             save(count,'foo')

代码确实有效，但我想了解是否可以优化它。我不太喜欢 for 循环中的 if 语句。

这里 itertools.islice() 是好的选择吗？该文档指出“如果 start 不为零，则将跳过 iterable 中的元素，直到到达 start”。由于“开始”可能非常大，考虑到我正在处理简单的文本文件，是否有更快的选择，以便直接“跳转”到开始行？
知道文本文件是固定的，将实际文件拆分成 100 或 1000 个较小的文件并逐个读取它们是否更优化？这将在我的 for 循环中读取 if 语句。
我还可以选择一次读取一行块而不是逐行读取，然后处理图表列表。这是一个不错的选择吗？

每一行的字符数都是固定的。所以“跳跃”或许是可行的。

Answer 1

假设每行的大小相同，您可以使用内存映射文件按索引读取它，而无需使用 seek and tell。内存映射文件模拟 bytearray，您可以从数组中获取记录大小的切片以获取所需的数据。如果要暂停处理，只需将当前记录索引保存在数组中，稍后您可以使用该索引重新启动。

此示例在 linux 上 - 在 windows 上打开的 mmap 有点不同 - 但在设置之后，访问应该是相同的。

import os
import mmap

# I think this is the record plus newline
LINE_SZ = 12
RECORD_SZ = LINE_SZ - 1 

# generate test file
testdata = "testdata.txt"
with open(testdata, 'wb') as f:
    for i in range(100):
        f.write("R{: 10}\n".format(i).encode('ascii'))

f = open(testdata, 'rb')
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

# the i-th record is
i = 20
record = data[i*LINE_SZ:i*LINE_SZ+RECORD_SZ] 
print("record 20", record)

# you can stick it in a function. this is a bit slower, but encapsulated
def get_record(mmapped_file, index):
    return mmapped_file[i*LINE_SZ:i*LINE_SZ+RECORD_SZ]

print("get record 20", get_record(data, 11))

# to enumerate
def enum_records(mmapped_file, start, stop=None, step=1):
    if stop is None:
        stop = mmapped_file.size()/LINE_SZ
    for pos in range(start*LINE_SZ, stop*LINE_SZ, step*LINE_SZ):
        yield mmapped_file[pos:pos+RECORD_SZ]

print("enum 6 to 8", [record for record in enum_records(data,6,9)])

del data
f.close()

Answer 2

如果行的长度不变（在本例中为 12（11 和结束符）），您可能会这样做

def get_line(k, line_len):
    with open('file') as f:
        f.seek(k*line_len)
        return next(f)

在 Python (SageMath 9.0) 中 - 1B 行的文本文件 - 从特定行读取的最佳方式

In Python (SageMath 9.0) - text file on 1B lines - optimal way to read from a specific line

python

optimization

sage