如何合并保存元数据的不同 matlab mat 文件以在 python 中使用？

Question

我有 1,000 多个非常长的 matlab 向量（可变长度 ~ 10^8 个样本）代表来自不同患者和来源的数据。我希望将它们紧凑地组织在一个文件中，以便以后在 python 中方便地访问。我希望每个样本都能以某种方式保存附加信息（患者 ID、采样频率等）。

顺序应该是：

Hospital 1:
   Pat. 1:
      vector:sample 1
      vector:sample 2

   Pat. 2:
      vector:sample 1
      vector:sample 2


Hospital 2:
   Pat. 1:
      vector:sample 1
      vector:sample 2
    .
    .
    .

我考虑过将示例转换为 hdf5 文件类型并添加元数据，然后将多个 hdf5 文件合并为一个文件，但我遇到了困难。

已经尝试过：

matlab: High-level hdf5 matlab 函数。
matlab：将变量保存为 v7.3 mat（实际上是 hdf5）
python: sidekit_io.h5merge

欢迎提出建议！

Answer 1

关于您上面给出的格式，您可能希望将向量存储在矩阵中。对于 hospital: 2 ,pat_ID: 3455679, age: 34, high_blood_pressure: NO(0 binary) 的患者样本，您可以将其存储为 "patient ID", "Hospital number","age","high_blood_pressure"... 为 2,3455679,34,0,...

a = [1:10]' %vector 1
b = [1:10]' %vector 2
c = [a,b]   %matrix holding vecotrs 1 and 2

Answer 2

我看到至少有两种 HDF5 方法。您可以将所有数据复制到一个文件中。千兆字节的数据对于 HDF5 来说不是问题（如果有足够的资源）。或者，您可以将患者数据保存在单独的文件中，并使用外部链接指向中央 HDF5 文件中的数据。创建链接后，您可以访问该文件中的数据 "as-if"。下面显示的这两种方法都使用 Numpy 随机创建的小而简单的 "samples"。每个样本都是一个数据集，包括带有医院、患者和样本 ID 的属性。

方法一：所有数据在一个文件中

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f.create_dataset(ds_name, data=vec_arr )
                # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt

方法 2：外部链接到单独文件中的患者数据

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149_link.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            fname = 'SO_59556149_' + 'H_' + str(h_cnt) + '_P_' + str(p_cnt) + '.h5'
            h5f2 = h5py.File(fname, 'w')
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f2.create_dataset(ds_name, data=vec_arr )
            # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt
                h5f[ds_name] = h5py.ExternalLink(fname, ds_name)
            h5f2.close()

如何合并保存元数据的不同 matlab mat 文件以在 python 中使用？

How to merge different matlab mat files holding metadata to use in python?

python

matlab

hdf5

bigdata

merging-data