Python：动态处理大型文档的行

Question

我有一份文件看起来有点像这样：

key1 value_1_1 value_1_2 value_1_3 etc
key2 value_2_1 value_2_2 value_2_3 etc
key3 value_3_1 value_3_2 value_3_3 etc
etc

其中每个 key 是一个字符串，每个 value 是一个浮点数，全部由空格分隔。每行都有数百个与之关联的值，并且有数十万行。每行都需要以特定方式处理，但因为我的程序只需要一小部分行的信息，所以立即处理每一行似乎是在浪费大量时间。目前，我只有每个未处理行的列表，并维护一个包含每个 key 的单独列表。当我需要访问一行时，我将使用 key 列表来查找我需要的行的索引，然后在行列表中的该索引处处理该行。我的程序可能会多次调用查找同一行，这将导致一遍又一遍地冗余处理同一行，但似乎仍然比从一开始就处理每一行要好。

我的问题是，是否有更有效的方法来完成我正在做的事情？

（如果我需要做出任何说明，请告诉我）

谢谢！

Answer 1

首先，我会将您的台词存储在 dict 中。这可能会使基于键的查找更快。制作这个 dict 可以像 d = dict(line.split(' ', 1) for line in file_obj) 一样简单。例如，如果按键具有固定宽度，您可以通过切割线条来加快速度。

接下来，如果行处理的计算量很大，您可以缓冲结果。我通过子类化 dict:

解决了这个问题

class BufferedDict(dict):
    def __init__(self, file_obj):
        self.file_dict = dict(line.split(' ', 1) for line in file_obj)

    def __getitem__(self, key):
        if key not in self:
            self[key] = process_line(self.file_dict[key])
        return super(BufferedDict, self).__getitem__(key)

def process_line(line):
    """Your computationally heavy line processing function"""

这样，如果您调用 my_buffered_dict[key]，仅当处理的版本尚不可用时才会处理该行。

Answer 2

这里有一个class扫描文件并简单缓存文件偏移量。仅当访问其键时才处理行。 __getitem__ 缓存处理过的行。

class DataFileDict:
    def __init__(self, datafile):
        self._index = {}
        self._file = datafile

        # build index of key-file offsets
        loc = self._file.tell()
        for line in self._file:
            key = line.split(None, 1)[0]
            self._index[key] = loc
            loc = self._file.tell()

    def __getitem__(self, key):
        retval = self._index[key]
        if isinstance(retval, int):
            self._file.seek(retval)
            line = self._file.readline()
            retval = self._index[key] = list(map(float, line.split()[1:]))
            print("read and return value for {} from file".format(key))
        else:
            print("returning cached value for {}".format(key))
        return retval

if __name__ == "__main__":
    from io import StringIO

    sample = StringIO("""\
A 1 2 3 4 5
B 6 7 8 9 10
C 5 6 7 8 1 2 3 4 5 6 7
""")

    reader = DataFileDict(sample))
    print(reader['A'])
    print(reader['B'])
    print(reader['A'])
    print(reader['C'])
    print(reader['D'])  # KeyError

打印

read and return value for A from file
[1.0, 2.0, 3.0, 4.0, 5.0]
read and return value for B from file
[6.0, 7.0, 8.0, 9.0, 10.0]
returning cached value for A
[1.0, 2.0, 3.0, 4.0, 5.0]
read and return value for C from file
[5.0, 6.0, 7.0, 8.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Traceback (most recent call last):
  File "C:/Users/ptmcg/.PyCharm2017.1/config/scratches/scratch.py", line 64, in <module>
    print(reader['D'])  # KeyError
  File "C:/Users/ptmcg/.PyCharm2017.1/config/scratches/scratch.py", line 28, in __getitem__
    retval = self._index[key]
KeyError: 'D'

Python：动态处理大型文档的行

Python: processing lines of a large document on the fly

python

readlines