优化生成具有子数组和可变长度的 numpy 关联数组

Question

我有 Python 生成的数据，类型为

fa      fb      fc
fa1     fb1     [fc01, fc02,..., fc0m]
fa2     fb2     [fc11, fc12,..., fc1m]
...     ...     ...
fan     fbn     [fcn1, fcn2,..., fcnm]

我需要创建一个 Python 兼容的数据结构来存储它，最大限度地简化创建，并最大限度地减少内存使用和 read/write 时间。我需要能够通过字段名称识别列（即用 data['fa'][0] 之类的东西检索 fa1）。 fa 值是整数，fb 和 fc 是浮点数。 m 和 n 在运行时间之前都是未知的，但是在数据被插入数据结构之前是已知的，并且不会改变。 m不会超过1000，n不会超过10000。数据一次生成一行。

到目前为止，我使用的是 dtype=[('f0,'i2'), ('f1','f8'), ('f2', 'f8', (m))] 的 numpy 关联数组 asar。但是，由于我不能在每次添加行时不删除并重新创建它的情况下向 numpy 数组添加新行，因此我一直在使用单独的计数变量 ind_n，创建 asar asar = numpy.zeroes(n, dtype=dtype)，用要添加的数据覆盖 asar[ind_n] 的零，然后递增 ind_n 直到达到 n。这行得通，但似乎必须有更好的解决方案（或者至少是一个允许我消除 ind_n 的解决方案）。是否有一种标准方法来创建 asar 的骨架（可能使用 np.zeroes() 之类的东西），然后将每行数据插入到第一个非零行中？或者在完全生成嵌套列表后将标准 python 嵌套列表转换为关联数组的方法？（我知道这种转换肯定可以完成，但是当我尝试转换子数组时运行会出现问题（例如 ValueError: setting an array element with a sequence.）。）

Answer 1

如果您在创建第一条记录时知道n，您的解决方案基本上是正确的。

您可以使用 np.empty 而不是 np.zeros 来节省一点（但不多）时间。

如果你觉得 ind_n 不好，你可以创建一个数组迭代器。

>>> m = 5
>>> n = 7
>>> dt = [('col1', 'i2'), ('col2', float), ('col3', float, (m,))]
>>> data = [(np.random.randint(10), np.random.random(), np.random.random((m,))) for _ in range(n)]
>>> 
>>> rec = np.empty((n,), dt)
>>> irec = np.nditer(rec, op_flags=[['readwrite']], flags=['c_index'])
>>> 
>>> for src in data:
...     # roughly equivalent to list.append:
...     next(irec)[()] = src
...     print()
...     # getting the currently valid part:
...     print(irec.operands[0][:irec.index+1])
... 

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])]

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
 (6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])]

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
 (6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
 (3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])]

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
 (6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
 (3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
 (2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])]

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
 (6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
 (3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
 (2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
 (1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])]

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
 (6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
 (3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
 (2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
 (1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
 (0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])]

[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
 (6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
 (3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
 (2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
 (1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
 (0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])
 (6, 0.85225039, [0.62028917, 0.4895316 , 0.00922578, 0.66836154, 0.53082779])]

Answer 2

In [39]: n, m = 5, 3
In [41]: dt=np.dtype([('f0','i2'), ('f1','f8'), ('f2', 'f8', (m))])

In [45]: asar = np.zeros(n, dt)
In [46]: asar
Out[46]: 
array([(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
       (0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
       (0, 0., [0., 0., 0.])],
      dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])

按字段填写：

In [49]: asar['f0'] = np.arange(5)
In [50]: asar['f1'] = np.random.rand(5)
In [51]: asar['f2'] = np.random.rand(5,3)
In [52]: asar
Out[52]: 
array([(0, 0.45120412, [0.86481761, 0.08861093, 0.42212446]),
       (1, 0.63926708, [0.43788684, 0.89254029, 0.90637292]),
       (2, 0.33844457, [0.80352251, 0.25411018, 0.315124  ]),
       (3, 0.24271258, [0.27849709, 0.9905879 , 0.94155558]),
       (4, 0.89239324, [0.1580938 , 0.52844036, 0.59092695])],
      dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])

正在生成具有匹配嵌套的列表：

In [53]: alist = [(i,i,[10]*3) for i in range(5)]
In [54]: np.array(alist, dt)
Out[54]: 
array([(0, 0., [10., 10., 10.]), (1, 1., [10., 10., 10.]),
       (2, 2., [10., 10., 10.]), (3, 3., [10., 10., 10.]),
       (4, 4., [10., 10., 10.])],
      dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])

显然你可以这样做：

for i, row in enumerate(alist):
    asar[i] = row

enumerate 是生成索引和值的一种很好的惯用方式。但 range(n).

也是如此

优化生成具有子数组和可变长度的 numpy 关联数组

Optimal generation of numpy associative array with subarrays and variable length

python

arrays

associative-array

numpy