Numpy memmap 按列对大矩阵进行就地排序

Question

我想在 N >> 系统内存的第一列对形状为 (N, 2) 的矩阵进行排序。

使用内存中的 numpy 你可以做：

x = np.array([[2, 10],[1, 20]])
sortix = x[:,0].argsort()
x = x[sortix]

但这似乎要求 x[:,0].argsort() 适合内存，这不适用于 N >> 系统内存的 memmap（如果此假设错误，请纠正我）。

我可以使用 numpy memmap 就地实现这种排序吗？

（假设使用heapsort进行排序，使用简单数值数据类型）

Answer 1

解决方案可能很简单，使用顺序参数就地 sort。当然，order 需要字段名，因此必须先添加。

d = x.dtype
x = x.view(dtype=[(str(i), d) for i in range(x.shape[-1])])
array([[(2, 10)],
   [(1, 20)]], dtype=[('0', '<i8'), ('1', '<i8')])

字段名称为字符串，对应列索引。可以使用

就地进行排序

x.sort(order='0', axis=0)

然后转换回原始数据类型的常规数组

x.view(d)
array([[ 1, 20],
   [ 2, 10]])

这应该可行，但您可能需要根据数据在磁盘上的存储方式更改视图的显示方式，请参阅 the docs

For a.view(some_dtype), if some_dtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array), then the behavior of the view cannot be predicted just from the superficial appearance of a (shown by print(a)). It also depends on exactly how a is stored in memory. Therefore if a is C-ordered versus fortran-ordered, versus defined as a slice or transpose, etc., the view may give different results.

Answer 2

@user2699 回答问题很漂亮。我添加这个解决方案作为一个简化示例，以防您不介意将数据保留为 structured array，这会消除视图。

import numpy as np

filename = '/tmp/test'
x = np.memmap(filename, dtype=[('index', '<f2'),('other1', '<f2'),('other2', '<f2')], mode='w+', shape=(2,))
x[0] = (2, 10, 30)
x[1] = (1, 20, 20)
print(x.shape)
print(x)
x.sort(order='index', axis=0, kind='heapsort')
print(x)

(2,)
[(2., 10., 30.) (1., 20., 20.)]
[(1., 20., 20.) (2., 10., 30.)]

dtype 格式也是 documented here。

Numpy memmap 按列对大矩阵进行就地排序

Numpy memmap in-place sort of a large matrix by column

python

numpy

python-3.6

memmap