有没有一种简单的方法可以从 numpy.dtype.descr 中删除 "padding" 字段？

Question

上下文

自numpy version 1.16，如果访问结构化数组的多个字段，结果数组的dtype将与原始数组具有相同的项目大小，导致额外的“填充":

The new behavior as of Numpy 1.16 leads to extra “padding” bytes at the location of unindexed fields compared to 1.15. You will need to update any code which depends on the data having a “packed” layout.

这可能会导致问题，例如如果您想稍后将字段添加到有问题的数组中：

import numpy as np
import numpy.lib.recfunctions


a = np.array(
    [
        (10.0, 13.5, 1248, -2),
        (20.0, 0.0, 0, 0),
        (30.0, 0.0, 0, 0),
        (40.0, 0.0, 0, 0),
        (50.0, 0.0, 0, 999)
    ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
)  # some array stolen from here: 
print(a.shape, a.dtype, a.dtype.names, a.dtype.descr)
# all good so far

b = a[['x', 'i']]  # for further processing I only need certain fields
print(b.shape, b.dtype, b.dtype.names, b.dtype.descr)
# you will only notice the extra padding in the descr

# b = np.lib.recfunctions.repack_fields(b)
# workaround

# now when I add fields, this becomes an issue
c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1

print(c.dtype.names)
print(c['f1'])
# the void fields are filled with raw data and were given proper names
# that can be accessed

现在的解决方法是使用 numpy.lib.recfunctions.repack_fields, which removes the padding, and I will use this in the future, but for my previous code, I need a fix. (Though there can be issues 和 recfunctions，因为可能找不到模块；就像我的情况一样，因此附加 import numpy.lib.recfunctions 声明。）

问题

这部分代码是我用来给数组添加字段的（基于）：

c = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
c[list(b.dtype.names)] = b
c['c'] = 1

虽然（现在我知道了）使用 numpy.lib.recfunctions.require_fields 可能更适合添加字段。但是，我仍然需要一种方法来从 b.dtype.descr:

中删除空字段

[('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]

这只是 tuples 的 list，所以我想我可以构建一个或多或少笨拙的方式（按照 descr.remove(('', '|V8')) 的方式）来处理这个问题，但是我想知道是否有更好的方法，特别是因为空隙的大小取决于遗漏字段的数量，例如如果连续有两个，则从 V8 到 V16，依此类推（而不是每个遗漏字段的新空白）。所以代码很快就会变得笨拙。

Answer 1

In [237]: a = np.array(
     ...:     [
     ...:         (10.0, 13.5, 1248, -2),
     ...:         (20.0, 0.0, 0, 0),
     ...:         (30.0, 0.0, 0, 0),
     ...:         (40.0, 0.0, 0, 0),
     ...:         (50.0, 0.0, 0, 999)
     ...:     ], dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')]
     ...:     )
In [238]: a
Out[238]: 
array([(10., 13.5, 1248,  -2), (20.,  0. ,    0,   0),
       (30.,  0. ,    0,   0), (40.,  0. ,    0,   0),
       (50.,  0. ,    0, 999)],
      dtype=[('x', '<f8'), ('y', '<f8'), ('i', '<i8'), ('j', '<i8')])

b 视图：

In [240]: b = a[['x','i']]
In [241]: b
Out[241]: 
array([(10., 1248), (20.,    0), (30.,    0), (40.,    0), (50.,    0)],
      dtype={'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})

重新包装的副本：

In [243]: c = rf.repack_fields(b)
In [244]: c
Out[244]: 
array([(10., 1248), (20.,    0), (30.,    0), (40.,    0), (50.,    0)],
      dtype=[('x', '<f8'), ('i', '<i8')])
In [245]: c.dtype
Out[245]: dtype([('x', '<f8'), ('i', '<i8')])

您在添加字段时过度填充的尝试：

In [247]: d = np.empty(b.shape, dtype=b.dtype.descr + [('c', 'i4')])
     ...: d[list(b.dtype.names)] = b
     ...: d['c'] = 1
In [248]: d
Out[248]: 
array([(10., b'\x00\x00\x00\x00\x00\x00\x00\x00', 1248, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
       (20., b'\x00\x00\x00\x00\x00\x00\x00\x00',    0, b'\x00\x00\x00\x00\x00\x00\x00\x00', 1),
       ...],
      dtype=[('x', '<f8'), ('f1', 'V8'), ('i', '<i8'), ('f3', 'V8'), ('c', '<i4')])

我第一次尝试制作不包含 Void 字段的 dtype。我不知道简单地测试 V 是否足够可靠：

In [253]: [des for des in b.dtype.descr if not 'V' in des[1]]
Out[253]: [('x', '<f8'), ('i', '<i8')]

并从中创建一个新的数据类型：

In [254]: d_dtype = _ + [('c','i4')]

所有这些都是正常的 python 列表和元组操作。我在其他 recfunctions 中看到过。我怀疑 repack_fields 做了类似的事情。

现在我们用更简单的数据类型创建一个新数组：

In [255]: d = np.empty(b.shape, dtype=d_dtype)
In [256]: d[list(b.dtype.names)] = b
     ...: d['c'] = 1
In [257]: d
Out[257]: 
array([(10., 1248, 1), (20.,    0, 1), (30.,    0, 1), (40.,    0, 1),
       (50.,    0, 1)], dtype=[('x', '<f8'), ('i', '<i8'), ('c', '<i4')])

我从 repack_fields 中提取了构建新的、未填充的 dtype 的代码：

In [262]: def foo(a):
     ...:     fieldinfo = []
     ...:     for name in a.names:
     ...:         tup = a.fields[name]
     ...:         fmt = tup[0]
     ...:         if len(tup) == 3:
     ...:             name = (tup[2], name)
     ...:         fieldinfo.append((name, fmt))
     ...:     print(fieldinfo)
     ...:     dt = np.dtype(fieldinfo)
     ...:     return dt
     ...: 
     ...: 
In [263]: foo(b.dtype)
[('x', dtype('float64')), ('i', dtype('int64'))]
Out[263]: dtype([('x', '<f8'), ('i', '<i8')])

这适用于 dtype.fields 而不是 dtype.descr。一个是 dict 另一个是列表。

In [274]: b.dtype
Out[274]: dtype({'names':['x','i'], 'formats':['<f8','<i8'], 'offsets':[0,16], 'itemsize':32})
In [275]: b.dtype.descr
Out[275]: [('x', '<f8'), ('', '|V8'), ('i', '<i8'), ('', '|V8')]
In [276]: b.dtype.fields
Out[276]: mappingproxy({'x': (dtype('float64'), 0), 'i': (dtype('int64'), 16)})
In [277]: b.dtype.fields['x']
Out[277]: (dtype('float64'), 0)

另一种从 b.dtype 获取有效 descr 元组的方法：

In [278]: [des for des in b.dtype.descr if des[0] in b.dtype.names]
Out[278]: [('x', '<f8'), ('i', '<i8')]

有没有一种简单的方法可以从 numpy.dtype.descr 中删除 "padding" 字段？

Is there a simple way to remove "padding" fields from numpy.dtype.descr?

python

numpy

structured-array

上下文

问题