优化生成具有子数组和可变长度的 numpy 关联数组
Optimal generation of numpy associative array with subarrays and variable length
我有 Python 生成的数据,类型为
fa fb fc
fa1 fb1 [fc01, fc02,..., fc0m]
fa2 fb2 [fc11, fc12,..., fc1m]
... ... ...
fan fbn [fcn1, fcn2,..., fcnm]
我需要创建一个 Python 兼容的数据结构来存储它,最大限度地简化创建,并最大限度地减少内存使用和 read/write 时间。我需要能够通过字段名称识别列(即用 data['fa'][0]
之类的东西检索 fa1
)。 fa
值是整数,fb
和 fc
是浮点数。 m
和 n
在 运行 时间之前都是未知的,但是在数据被插入数据结构之前是已知的,并且不会改变。 m
不会超过1000,n
不会超过10000。数据一次生成一行。
到目前为止,我使用的是 dtype=[('f0,'i2'), ('f1','f8'), ('f2', 'f8', (m))]
的 numpy 关联数组 asar
。但是,由于我不能在每次添加行时不删除并重新创建它的情况下向 numpy 数组添加新行,因此我一直在使用单独的计数变量 ind_n
,创建 asar
asar = numpy.zeroes(n, dtype=dtype)
,用要添加的数据覆盖 asar[ind_n]
的零,然后递增 ind_n
直到达到 n
。这行得通,但似乎必须有更好的解决方案(或者至少是一个允许我消除 ind_n
的解决方案)。是否有一种标准方法来创建 asar
的骨架(可能使用 np.zeroes()
之类的东西),然后将每行数据插入到第一个非零行中?或者在完全生成嵌套列表后将标准 python 嵌套列表转换为关联数组的方法? (我知道这种转换肯定可以完成,但是当我尝试转换子数组时 运行 会出现问题(例如 ValueError: setting an array element with a sequence.
)。)
如果您在创建第一条记录时知道n
,您的解决方案基本上是正确的。
您可以使用 np.empty
而不是 np.zeros
来节省一点(但不多)时间。
如果你觉得 ind_n
不好,你可以创建一个数组迭代器。
>>> m = 5
>>> n = 7
>>> dt = [('col1', 'i2'), ('col2', float), ('col3', float, (m,))]
>>> data = [(np.random.randint(10), np.random.random(), np.random.random((m,))) for _ in range(n)]
>>>
>>> rec = np.empty((n,), dt)
>>> irec = np.nditer(rec, op_flags=[['readwrite']], flags=['c_index'])
>>>
>>> for src in data:
... # roughly equivalent to list.append:
... next(irec)[()] = src
... print()
... # getting the currently valid part:
... print(irec.operands[0][:irec.index+1])
...
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
(0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
(0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])
(6, 0.85225039, [0.62028917, 0.4895316 , 0.00922578, 0.66836154, 0.53082779])]
In [39]: n, m = 5, 3
In [41]: dt=np.dtype([('f0','i2'), ('f1','f8'), ('f2', 'f8', (m))])
In [45]: asar = np.zeros(n, dt)
In [46]: asar
Out[46]:
array([(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
(0, 0., [0., 0., 0.])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
按字段填写:
In [49]: asar['f0'] = np.arange(5)
In [50]: asar['f1'] = np.random.rand(5)
In [51]: asar['f2'] = np.random.rand(5,3)
In [52]: asar
Out[52]:
array([(0, 0.45120412, [0.86481761, 0.08861093, 0.42212446]),
(1, 0.63926708, [0.43788684, 0.89254029, 0.90637292]),
(2, 0.33844457, [0.80352251, 0.25411018, 0.315124 ]),
(3, 0.24271258, [0.27849709, 0.9905879 , 0.94155558]),
(4, 0.89239324, [0.1580938 , 0.52844036, 0.59092695])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
正在生成具有匹配嵌套的列表:
In [53]: alist = [(i,i,[10]*3) for i in range(5)]
In [54]: np.array(alist, dt)
Out[54]:
array([(0, 0., [10., 10., 10.]), (1, 1., [10., 10., 10.]),
(2, 2., [10., 10., 10.]), (3, 3., [10., 10., 10.]),
(4, 4., [10., 10., 10.])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
显然你可以这样做:
for i, row in enumerate(alist):
asar[i] = row
enumerate
是生成索引和值的一种很好的惯用方式。但 range(n)
.
也是如此
我有 Python 生成的数据,类型为
fa fb fc
fa1 fb1 [fc01, fc02,..., fc0m]
fa2 fb2 [fc11, fc12,..., fc1m]
... ... ...
fan fbn [fcn1, fcn2,..., fcnm]
我需要创建一个 Python 兼容的数据结构来存储它,最大限度地简化创建,并最大限度地减少内存使用和 read/write 时间。我需要能够通过字段名称识别列(即用 data['fa'][0]
之类的东西检索 fa1
)。 fa
值是整数,fb
和 fc
是浮点数。 m
和 n
在 运行 时间之前都是未知的,但是在数据被插入数据结构之前是已知的,并且不会改变。 m
不会超过1000,n
不会超过10000。数据一次生成一行。
到目前为止,我使用的是 dtype=[('f0,'i2'), ('f1','f8'), ('f2', 'f8', (m))]
的 numpy 关联数组 asar
。但是,由于我不能在每次添加行时不删除并重新创建它的情况下向 numpy 数组添加新行,因此我一直在使用单独的计数变量 ind_n
,创建 asar
asar = numpy.zeroes(n, dtype=dtype)
,用要添加的数据覆盖 asar[ind_n]
的零,然后递增 ind_n
直到达到 n
。这行得通,但似乎必须有更好的解决方案(或者至少是一个允许我消除 ind_n
的解决方案)。是否有一种标准方法来创建 asar
的骨架(可能使用 np.zeroes()
之类的东西),然后将每行数据插入到第一个非零行中?或者在完全生成嵌套列表后将标准 python 嵌套列表转换为关联数组的方法? (我知道这种转换肯定可以完成,但是当我尝试转换子数组时 运行 会出现问题(例如 ValueError: setting an array element with a sequence.
)。)
如果您在创建第一条记录时知道n
,您的解决方案基本上是正确的。
您可以使用 np.empty
而不是 np.zeros
来节省一点(但不多)时间。
如果你觉得 ind_n
不好,你可以创建一个数组迭代器。
>>> m = 5
>>> n = 7
>>> dt = [('col1', 'i2'), ('col2', float), ('col3', float, (m,))]
>>> data = [(np.random.randint(10), np.random.random(), np.random.random((m,))) for _ in range(n)]
>>>
>>> rec = np.empty((n,), dt)
>>> irec = np.nditer(rec, op_flags=[['readwrite']], flags=['c_index'])
>>>
>>> for src in data:
... # roughly equivalent to list.append:
... next(irec)[()] = src
... print()
... # getting the currently valid part:
... print(irec.operands[0][:irec.index+1])
...
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
(0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])]
[(9, 0.07368308, [0.44691665, 0.38875103, 0.83522137, 0.39281718, 0.62078615])
(6, 0.82350335, [0.57971597, 0.61270304, 0.05280996, 0.03702404, 0.99159465])
(3, 0.06565234, [0.88921842, 0.21097122, 0.83276431, 0.01824657, 0.49105466])
(2, 0.69806099, [0.87749632, 0.22119474, 0.25623813, 0.26587436, 0.04772489])
(1, 0.77573727, [0.44359522, 0.62471617, 0.65742177, 0.38889958, 0.13901824])
(0, 0.45797521, [0.79193395, 0.69029592, 0.0541346 , 0.49603146, 0.36146384])
(6, 0.85225039, [0.62028917, 0.4895316 , 0.00922578, 0.66836154, 0.53082779])]
In [39]: n, m = 5, 3
In [41]: dt=np.dtype([('f0','i2'), ('f1','f8'), ('f2', 'f8', (m))])
In [45]: asar = np.zeros(n, dt)
In [46]: asar
Out[46]:
array([(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
(0, 0., [0., 0., 0.]), (0, 0., [0., 0., 0.]),
(0, 0., [0., 0., 0.])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
按字段填写:
In [49]: asar['f0'] = np.arange(5)
In [50]: asar['f1'] = np.random.rand(5)
In [51]: asar['f2'] = np.random.rand(5,3)
In [52]: asar
Out[52]:
array([(0, 0.45120412, [0.86481761, 0.08861093, 0.42212446]),
(1, 0.63926708, [0.43788684, 0.89254029, 0.90637292]),
(2, 0.33844457, [0.80352251, 0.25411018, 0.315124 ]),
(3, 0.24271258, [0.27849709, 0.9905879 , 0.94155558]),
(4, 0.89239324, [0.1580938 , 0.52844036, 0.59092695])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
正在生成具有匹配嵌套的列表:
In [53]: alist = [(i,i,[10]*3) for i in range(5)]
In [54]: np.array(alist, dt)
Out[54]:
array([(0, 0., [10., 10., 10.]), (1, 1., [10., 10., 10.]),
(2, 2., [10., 10., 10.]), (3, 3., [10., 10., 10.]),
(4, 4., [10., 10., 10.])],
dtype=[('f0', '<i2'), ('f1', '<f8'), ('f2', '<f8', (3,))])
显然你可以这样做:
for i, row in enumerate(alist):
asar[i] = row
enumerate
是生成索引和值的一种很好的惯用方式。但 range(n)
.