为基于 2D 条件的子集索引大型 3D HDF5 数据集

Question

我有一个大型 3D HDF5 数据集，它表示某个变量的位置 (X,Y) 和时间。接下来，我有一个 2D numpy 数组，其中包含相同 (X,Y) 位置的 classification。我想要实现的是，我可以从 3D HDF5 数据集中提取属于 2D 数组中某个 class 的所有时间序列。

这是我的例子：

import numpy as np
import h5py

# Open the HDF5 dataset
NDVI_file = 'NDVI_values.hdf5'
f_NDVI = h5py.File(NDVI_file,'r')
NDVI_data = f_NDVI["NDVI"]

# See what's in the dataset
NDVI_data
<HDF5 dataset "NDVI": shape (1319, 2063, 53), type "<f4">

# Let's make a random 1319 x 2063 classification containing class numbers 0-4
classification = np.random.randint(5, size=(1319, 2063))

现在我们有了 3D HDF5 数据集和 2D class化。让我们寻找属于 class 数字“3”

的像素

# Look for the X,Y locations that have class number '3'
idx = np.where(classification == 3)

这个 returns 我是一个大小为 2 的元组，其中包含符合条件的 X,Y 对，在我的随机示例中，对的数量是 544433。我现在应该如何使用这个 idx 变量创建一个大小为 (544433,53) 的二维数组，其中包含具有 classification class 数字 '3'?

的像素的 544433 时间序列

我用花哨的索引和纯 3D numpy 数组做了一些测试，这个例子会工作得很好：

subset = 3D_numpy_array[idx[0],idx[1],:]

但是，HDF5 数据集太大，无法转换为 numpy 数组；当我尝试直接在 HDF5 数据集上使用相同的索引方法时：

# Try to use fancy indexing directly on HDF5 dataset
NDVI_subset = np.array(NDVI_data[idx[0],idx[1],:])

它抛出一个错误：

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper     (C:\aroot\work\h5py\_objects.c:2584)
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (C:\aroot\work\h5py\_objects.c:2543)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\dataset.py", line 431, in __getitem__
selection = sel.select(self.shape, args, dsid=self.id)
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 95, in select
sel[args]
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 429, in __getitem__
raise TypeError("Indexing elements must be in increasing order")
TypeError: Indexing elements must be in increasing order

我尝试的另一件事是 np.repeat 第 3 维中的 class 化数组，以创建与 HDF5 数据集的形状相匹配的 3D 数组。 idx 变量得到一个大小为 3 的元组：

classification_3D = np.repeat(np.reshape(classification,(1319,2063,1)),53,axis=2)
idx = np.where(classification == 3)

但是下面的语句会抛出完全相同的错误：

NDVI_subset = np.array(NDVI_data[idx])

这是因为 HDF5 数据集与纯 numpy 数组相比工作方式不同吗？文档确实说 "Selection coordinates must be given in increasing order"

在那种情况下，有没有人建议我如何使它工作而不必将完整的 HDF5 数据集读入内存（这不起作用）？非常感谢！

Answer 1

Advanced/fancy h5py 中的索引不如 np.ndarray.

中的索引那么普遍

设置一个小测试用例：

import h5py
f=h5py.File('test.h5','w')
dset=f.create_dataset('data',(5,3,2),dtype='i')
dset[...]=np.arange(5*3*2).reshape(5,3,2)
x=np.arange(5*3*2).reshape(5,3,2)

ind=np.where(x%2)

我可以 select 所有奇数值：

In [202]: ind
Out[202]: 
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=int32),
 array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32),
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32))

In [203]: x[ind]
Out[203]: array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29])
In [204]: dset[ind]
...
TypeError: Indexing elements must be in increasing order

我可以使用列表在单个维度上建立索引，例如：dset[[1,2,3],...]，但重复索引值或更改顺序会产生错误，dset[[1,1,2,2],...] 或 dset[[2,1,0],...]。 dset[:,[0,1],:] 可以。

几个切片就可以，dset[0:3,1:3,:]，或者一个切片和列表，dset[0:3,[1,2],:]。

但是 2 个列表 dset[[0,1,2],[1,2],:] 产生一个

TypeError: Only one indexing vector or array is currently allowed for advanced selection

所以 np.where 的索引元组在几个方面是错误的。

我不知道这其中有多少是h5存储的限制，有多少只是h5py模块中的不完整开发。也许两者兼而有之。

因此您需要从文件中加载更简单的块，并对生成的 numpy 数组执行更高级的索引。

在我的 odd values 案例中，我只需要做：

In [225]: dset[:,:,1]
Out[225]: 
array([[ 1,  3,  5],
       [ 7,  9, 11],
       [13, 15, 17],
       [19, 21, 23],
       [25, 27, 29]])

为基于 2D 条件的子集索引大型 3D HDF5 数据集

Indexing a large 3D HDF5 dataset for subsetting based on 2D condition

python

arrays

indexing

numpy

hdf5