Python

Question

我正在处理具有数千行的数据，但我的列不均匀，如下所示：

AB  12   43   54

DM  33   41   45   56   33   77  88

MO  88   55   66   32   34 

KL  10   90   87   47   23  48  56  12

首先，我想读取列表或数组中的数据，然后找出最长行的长度。
然后，我将向短行添加零，使它们与最长的行相等，这样我就可以将它们作为二维数组进行迭代。

我已经尝试了其他几个类似的问题，但无法解决问题。

我相信 Python 有办法做到这一点。谁能帮帮我？

Answer 1

我看不出有任何更简单的方法来计算最大行长度，而是通过一次并找到它。然后，我们在第二遍构建二维数组。类似于：

from __future__ import print_function
import numpy as np
from itertools import chain

data = '''AB 12 43 54
DM 33 41 45 56 33 77 88
MO 88 55 66 32 34
KL 10 90 87 47 23 48 56 12'''

max_row_len = max(len(line.split()) for line in data.splitlines())

def padded_lines():
    for uneven_line in data.splitlines():
        line = uneven_line.split()
        line += ['0']*(max_row_len - len(line))
        yield line

# I will get back to the line below shortly, it unnecessarily creates the array
# twice in memory:
array = np.array(list(chain.from_iterable(padded_lines())), np.dtype(object))

array.shape = (-1, max_row_len)

print(array)

这会打印：

[['AB' '12' '43' '54' '0' '0' '0' '0' '0']
 ['DM' '33' '41' '45' '56' '33' '77' '88' '0']
 ['MO' '88' '55' '66' '32' '34' '0' '0' '0']
 ['KL' '10' '90' '87' '47' '23' '48' '56' '12']]

上面的代码在内存中创建数组两次的意义上是低效的。我会回来的；我想我可以解决这个问题。

但是，numpy 数组应该是同构的。您想要将字符串（第一列）和整数（所有其他列）放在同一个二维数组中。 我仍然认为您走错了路，应该重新考虑问题并选择另一种数据结构或以不同方式组织数据。我无法帮助你，因为我不知道你想如何使用这些数据。

（稍后我会回到创建两次的数组问题。）

正如承诺的那样，这里是效率问题的解决方案。请注意，我担心的是内存消耗。

    def main():

        with open('/tmp/input.txt') as f:
            max_row_len = max(len(line.split()) for line in f)

        with open('/tmp/input.txt') as f:
            str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))

        def padded_lines():
            with open('/tmp/input.txt') as f:
                for uneven_line in f:
                    line = uneven_line.split()
                    line += ['0']*(max_row_len - len(line))
                    yield line

        fmt = '|S%d' % str_len_max
        array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))

这段代码可以做得更好，但我会把它留给你。

内存消耗，使用 memory_profiler 在具有 1000000 行且行长度在 1 到 20 之间均匀分布的随机生成的输入文件上测量：

Line #    Mem usage    Increment   Line Contents
================================================
     5   23.727 MiB    0.000 MiB   @profile
     6                             def main():
     7                                 
     8   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     9   23.727 MiB    0.000 MiB           max_row_len = max(len(line.split()) for line in f)
    10                                     
    11   23.727 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
    12   23.727 MiB    0.000 MiB           str_len_max = len(max(chain.from_iterable(line.split() for line in f), key=len))
    13                                 
    14   23.727 MiB    0.000 MiB       def padded_lines():
    15                                     with open('/tmp/input.txt') as f:
    16   62.000 MiB   38.273 MiB               for uneven_line in f:
    17                                             line = uneven_line.split()
    18                                             line += ['0']*(max_row_len - len(line))
    19                                             yield line
    20                                 
    21   23.727 MiB  -38.273 MiB       fmt = '|S%d' % str_len_max
    22                                 array = np.fromiter(chain.from_iterable(padded_lines()), np.dtype(fmt))
    23   62.004 MiB   38.277 MiB       array.shape = (-1, max_row_len)

使用代码 eumiro 的答案，并使用相同的输入文件：

Line #    Mem usage    Increment   Line Contents
================================================
     5   23.719 MiB    0.000 MiB   @profile
     6                             def main():
     7   23.719 MiB    0.000 MiB       with open('/tmp/input.txt') as f:
     8  638.207 MiB  614.488 MiB           arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

比较内存消耗增量：我更新后的代码消耗的内存比 eumiro 的少 16 倍（614.488/38.273 大约是 16）。

至于速度：我更新的代码为此输入运行了 3.321 秒，eumiro 的代码运行了 5.687 秒，也就是说，我的代码在我的机器上快 1.7 倍。（您的里程可能会有所不同。）

如果效率是您最关心的问题（如您的评论 "Hi eumiro, I suppose this is more efficient." 所建议，然后更改接受的答案），那么恐怕您接受了效率较低的解决方案.

不要误会，eumiro 的代码真的很简洁，我当然从中学到了很多东西。如果效率不是我最关心的，我会选择 eumiro 的也是解决方案。

Answer 2

您可以使用 itertools.izip_longest 为您查找最长的行：

import itertools as it
import numpy as np

with open('filename.txt') as f:
    arr = np.array(list(it.izip_longest(*[line.split() for line in f], fillvalue='0'))).T

arr 现在是：

array([['a', '1', '2', '0'],
       ['b', '3', '4', '5'],
       ['c', '6', '0', '0']], 
      dtype='|S1')

Python - 处理行中不均匀的列

Python - working with uneven columns in rows

numpy

genfromtxt