Python - 处理行中不均匀的列

Python - working with uneven columns in rows


AB  12   43   54

DM  33   41   45   56   33   77  88

MO  88   55   66   32   34 

KL  10   90   87   47   23  48  56  12



我相信 Python 有办法做到这一点。谁能帮帮我?


from __future__ import print_function
import numpy as np
from itertools import chain

data = '''AB 12 43 54
DM 33 41 45 56 33 77 88
MO 88 55 66 32 34
KL 10 90 87 47 23 48 56 12'''

max_row_len = max(len(line.split()) for line in data.splitlines())

def padded_lines():
    for uneven_line in data.splitlines():
        line = uneven_line.split()
        line += ['0']*(max_row_len - len(line))
        yield line

# I will get back to the line below shortly, it unnecessarily creates the array
# twice in memory:
array = np.array(list(chain.from_iterable(padded_lines())), np.dtype(object))

array.shape = (-1, max_row_len)



[['AB' '12' '43' '54' '0' '0' '0' '0' '0']
 ['DM' '33' '41' '45' '56' '33' '77' '88' '0']
 ['MO' '88' '55' '66' '32' '34' '0' '0' '0']
 ['KL' '10' '90' '87' '47' '23' '48' '56' '12']]


但是,numpy 数组应该是同构的。您想要将字符串(第一列)和整数(所有其他列)放在同一个二维数组中。 我仍然认为您走错了路,应该重新考虑问题并选择另一种数据结构或以不同方式组织数据。我无法帮助你,因为我不知道你想如何使用这些数据。



    def main():

        with open('/tmp/input.txt') as f:
            max_row_len = max(len(line.split()) for line in f)

        with open('/tmp/input.txt') as f:
            str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))

        def padded_lines():
            with open('/tmp/input.txt') as f:
                for uneven_line in f:
                    line = uneven_line.split()
                    line += ['0']*(max_row_len - len(line))
                    yield line

        fmt = '|S%d' % str_len_max
        array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))


内存消耗,使用 memory_profiler 在具有 1000000 行且行长度在 1 到 20 之间均匀分布的随机生成的输入文件上测量:

Line #    Mem usage    Increment   Line Contents
     5   23.727 MiB    0.000 MiB   @profile
     6                             def main():
     8   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     9   23.727 MiB    0.000 MiB           max_row_len = max(len(line.split()) for line in f)
    11   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
    12   23.727 MiB    0.000 MiB           str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
    14   23.727 MiB    0.000 MiB       def padded_lines():
    15                                     with open('/tmp/input.txt') as f:
    16   62.000 MiB   38.273 MiB               for uneven_line in f:
    17                                             line = uneven_line.split()
    18                                             line += ['0']*(max_row_len - len(line))
    19                                             yield line
    21   23.727 MiB  -38.273 MiB       fmt = '|S%d' % str_len_max
    22                                 array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
    23   62.004 MiB   38.277 MiB       array.shape = (-1, max_row_len)

使用代码 eumiro 的答案,并使用相同的输入文件:

Line #    Mem usage    Increment   Line Contents
     5   23.719 MiB    0.000 MiB   @profile
     6                             def main():
     7   23.719 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     8  638.207 MiB  614.488 MiB           arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

比较内存消耗增量:我更新后的代码消耗的内存比 eumiro 的少 16 倍(614.488/38.273 大约是 16)。

至于速度:我更新的代码为此输入运行了 3.321 秒,eumiro 的代码运行了 5.687 秒,也就是说,我的代码在我的机器上快 1.7 倍。 (您的里程可能会有所不同。)

如果效率是您最关心的问题(如您的评论 "Hi eumiro, I suppose this is more efficient." 所建议,然后更改接受的答案),那么恐怕您接受了效率较低的解决方案.

不要误会,eumiro 的代码真的很简洁,我当然从中学到了很多东西。如果效率不是我最关心的,我会选择 eumiro 的也是解决方案。

您可以使用 itertools.izip_longest 为您查找最长的行:

import itertools as it
import numpy as np

with open('filename.txt') as f:
    arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

arr 现在是:

array([['a', '1', '2', '0'],
       ['b', '3', '4', '5'],
       ['c', '6', '0', '0']], 