HDF5 Python - 处理来自多个进程的读取的正确方法？

Question

我有一个图像生成器，可以从 HDF5 文件（通过 h5py）读取成批的 3D 张量，它使用 Python 多处理库（继承自 Keras Sequence）。

我想知道我这样做是否正确，是否可以改进。

我有一个 __getitem__ 方法被多个进程调用 N 次迭代，每次调用此方法时，我都会打开一个 HDF5 文件并读取一组给定索引的数据，然后立即关闭文件（通过上下文管理器）。

def get_dataset_items(self, dataset: str, indices: np.ndarray) -> np.ndarray:
        """Get an h5py dataset items.

        Arguments
        ---------
        dataset : str
            The dataset key.
        indices : ndarray, 
            The list of current batch indices.

        Returns
        -------
        np.ndarray
            An batch of elements.
        """
        with h5.File(self.src, 'r') as file:
            return file[dataset][indices]

这个方法看起来没有问题，但我真的不确定。我读到当从多个进程读取文件时，我们可能会遇到奇怪的东西和损坏的数据。

我看到有MPI接口和SWMR模式

我可以从这些功能中受益吗？

Answer 1

这不是一个明确的答案，但是对于压缩数据，我今天遇到了问题，在寻找这个修复时发现了你的问题：Giving a python file object to h5py instead of a filename, you can bypass some问题并通过多处理读取压缩数据

    # using the python file-open overcomes some complex problem
    with h5py.File( open(self.src, "rb"), "r" ) as hfile:
        # grab the data from hfile
        groups = list(h['/']) # etc

据我所知，hdf 正在尝试为压缩（分块）数据“优化”磁盘 IO。如果多个进程试图读取相同的块，您可能不想为每个进程解压缩它们。这造成了混乱。使用 python 文件对象，我们可以希望库不再知道进程正在查看相同的数据，并且将停止尝试提供帮助。

HDF5 Python - 处理来自多个进程的读取的正确方法？

HDF5 Python - Correct way to handle reads from multiple processes?

python

hdf5

h5py