如何使用 h5py 将值并行添加到现有的 HDF5 文件中，其中包含 3 个组和每个组中的 12 个数据集？

Question

我已经使用这个 link 安装了库。我已经使用 mpiexec -n 1 python3 test.py 创建了一个名为 test.h5 的 HDF5 文件。 test.py如下，不知道这里是否有必要使用mpi4py，请告知。

from mpi4py import MPI
import h5py

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

f = h5py.File('test.h5', 'w', driver='mpio', comm=comm)

f.create_group('t1')
f.create_group('t2')
f.create_group('t3')

for i in range(12):
    f['t1'].create_dataset('test{0}'.format(i), (1,), dtype='f', compression='gzip')
    f['t2'].create_dataset('test{0}'.format(i), (1,), dtype='i', compression='gzip')
    f['t3'].create_dataset('test{0}'.format(i), (1,), dtype='i', compression='gzip')

f.close()

现在，我想编写一个 test1.py 文件，它将：

打开test.h5并获取所有唯一键（所有三个组都相同）。
将这些键分块，例如 chunks = [['test0','test1','test2'],['test3','test4','test5'],['test6','test7','test8'],['test9','test10','test11']]。我不关心这些块的顺序或分组，但我希望每个进程一个块。
为每个块分配一个进程来为每个组中该块中的每个键存储一个值。换句话说，我想运行这个函数并行：

def write_h5(f, rank, chunks):
    for key in chunks[rank]:
        f['t1'][key][:] += 0.5
        f['t2'][key][:] += 1
        f['t3'][key][:] += 1

我该怎么做？你能详细解释一下吗？提前致谢！

Answer 1

test1.py 应包含：

from mpi4py import MPI
import h5py

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

def chunk_seq(seq, num):
    avg = len(seq) / float(num)
    out = []
    last = 0.0
    while last < len(seq):
        out.append(seq[int(last):int(last + avg)])
        last += avg
    return out

def write_h5(f, chunk):
    for key in chunk:
        f['t1'][key][:] += 0.5
        f['t2'][key][:] += 1
        f['t3'][key][:] += 1

f = h5py.File('test.h5', 'a', driver='mpio', comm=comm)
chunks = chunk_seq(list(f['t1'].keys()), size)

write_h5(f, chunks[rank])

f.close()

运行它使用：mpiexec -n 4 python3 test1.py。问题是，只有当您在创建数据集时没有设置 compression='gzip' 时，这才会起作用。作为参考，检查问题 Does HDF5 support compression with parallel HDF5 ? If not, why ? but I'm not sure if this holds true for the latest version. Looking at this 似乎您必须连续读取每个数据集并在压缩的新 HDF5 文件中创建相应的数据集。

如何使用 h5py 将值并行添加到现有的 HDF5 文件中，其中包含 3 个组和每个组中的 12 个数据集？

How do I add values in parallel to an existing HDF5 file with 3 groups and 12 datasets in each group using h5py?

python

mpi

hdf5

h5py

mpi4py