如何将 `mmap.mmap.seek(pos)` 的当前位置设置为文本文件第 N 行的开头？

Question

我正在尝试使用 mmap 从内存不足的 csv 文件中读取一些行。

这是我的 csv 文件的样子 [为了便于阅读，我将行分开]:

'','InputText',101,102,103,104,105,106,107,108,109,110\n
0,'abcde efgh ijkl mnop',1,0,0,0,0,1,1,0,0,0\n
1,'qwerty uiop asdf',1,0,0,1,0,0,0,0,0,0\n
2,'zxcv',0,1,1,0,0,0,0,1,0,0\n
3,'qazxswedc vfrtgbnhy nhyummjikkig jhguopjfservcs fdtuugdsae dsawruoh',1,0,0,0,0,1,1,1,0,0\n
4,'plmnkoijb vhuygcxf tr r mhjease',1,0,0,0,0,0,0,0,0,1\n

这是我到目前为止所做的：

# imports
import mmap
import os

# open the file buffer
fbuff = open("big_file.csv", mode="r", encoding="utf8")
# now read that file buffer to mmap
f1_mmap = mmap.mmap(fbuff.fileno(), length=os.path.getsize("big_file.csv"),
                      access=mmap.ACCESS_READ, offset=0)

在将文件读取到 mmap.mmap() 之后，这是我尝试读取一行的方式，如 python-3.7 docs here:

中所述

# according to python docs: https://docs.python.org/3.7/library/mmap.html#mmap.mmap.seek
# this mmap.mmap.seek need to be set to the byte position in the file
# and when I set it to 0th position(beginning of file) like below, readline() would print entire line till '\n'
f1_mmap.seek(0)
f1_mmap.readline()

如果我想读取文件中的第 102,457 行，我需要找到该行的起始字节位置并将其设置在 mmap.mmap.seek(pos=<this-position>) 中。我如何找到我的文本文件的任何给定行的位置？

Answer 1

以下是如何构建一个索引，该索引由文件中每一行开头的偏移量列表组成，然后如何使用它来读取任意行以及内存映射 CSV 文件的行：

import csv
from io import StringIO
import mmap
import random

my_csv_dialect = dict(delimiter=',', quotechar="'")
filepath = 'big_file.csv'

# Build list of offsets where each line of file starts.
fbuff = open(filepath, mode='r', encoding='utf8')
f1_mmap = mmap.mmap(fbuff.fileno(), 0, access=mmap.ACCESS_READ)

print('Index:')
offsets = [0]  # First line is always at offset 0.
for line_no, line in enumerate(iter(f1_mmap.readline, b'')):
    offsets.append(f1_mmap.tell())  # Append where *next* line would start.
    print(f'{line_no} ({offsets[line_no]:3d}) {line!r}')
print()

# Access arbitrary lines in the memory-mapped file.
print('Line access:')
for line_no in (3, 1, 5):
    f1_mmap.seek(offsets[line_no])
    line = f1_mmap.readline()
    print(f'{line_no}: {line!r}')
print()

# Access arbitrary rows of memory-mapped csv file.
print('CSV row access:')
for line_no in (3, 1, 5):
    f1_mmap.seek(offsets[line_no])
    line = f1_mmap.readline()
    b = StringIO(line.decode())
    r = csv.reader(b, **my_csv_dialect)
    values = next(r)
    print(f'{line_no}: {values}')

f1_mmap.close()
fbuff.close()

打印结果：

Index:
0 (  0) b"'','InputText',101,102,103,104,105,106,107,108,109,110\r\n"
1 ( 56) b"0,'abcde efgh ijkl mnop',1,0,0,0,0,1,1,0,0,0\r\n"
2 (102) b"1,'qwerty uiop asdf',1,0,0,1,0,0,0,0,0,0\r\n"
3 (144) b"2,'zxcv',0,1,1,0,0,0,0,1,0,0\r\n"
4 (174) b"3,'qazxswedc vfrtgbnhy nhyummjikkig jhguopjfservcs fdtuugdsae dsawruoh',1,0,0,0,0,1,1,1,0,0\r\n"
5 (267) b"4,'plmnkoijb vhuygcxf tr r mhjease',1,0,0,0,0,0,0,0,0,1\r\n"

Line access:
3: b"2,'zxcv',0,1,1,0,0,0,0,1,0,0\r\n"
1: b"0,'abcde efgh ijkl mnop',1,0,0,0,0,1,1,0,0,0\r\n"
5: b"4,'plmnkoijb vhuygcxf tr r mhjease',1,0,0,0,0,0,0,0,0,1\r\n"

CSV row access:
3: ['2', 'zxcv', '0', '1', '1', '0', '0', '0', '0', '1', '0', '0']
1: ['0', 'abcde efgh ijkl mnop', '1', '0', '0', '0', '0', '1', '1', '0', '0', '0']
5: ['4', 'plmnkoijb vhuygcxf tr r mhjease', '1', '0', '0', '0', '0', '0', '0', '0', '0', '1']

如何将 `mmap.mmap.seek(pos)` 的当前位置设置为文本文件第 N 行的开头？

How to set current position of `mmap.mmap.seek(pos)` to beginning of any Nth line for a text file?

python

csv

mmap

python-3.x