导入的 *.mat 文件在 Python 中结束 "flat"

Imported *.mat file ends up "flat" in Python

我正在导入 a *.mat file into Python via a script that I found on Whosebug

import h5py

def read_matlab(filename):
    """
    Import *.mat-file.
    
    Source: 
    """
    print(f"Importing '{filename}' ...")
    
    def conv(path=''):
        p = path or '/'
        paths[p] = ret = {}
        for k, v in f[p].items():
            if type(v).__name__ == 'Group':
                ret[k] = conv(f'{path}/{k}')  # Nested struct
                continue
            v = v[()]  # It's a Numpy array now
            if v.dtype == 'object':
                # HDF5ObjectReferences are converted
                # into a list of actual pointers
                ret[k] = (
                    [r and paths.get(f[r].name, f[r].name) for r in v.flat]
                    )
            else:
                # Matrices and other numeric arrays
                ret[k] = v if v.ndim < 2 else v.swapaxes(-1, -2)
        return ret

    paths = {}
    with h5py.File(filename, 'r') as f:
        return conv()
    
file = read_matlab("test.mat")

我知道test.mat中包含的矩阵的维度是(1134,30807)。但是,file 是一个包含另一个具有三个键的字典的字典:

file["Y_RMRIO"].keys()
Out[5]: dict_keys(['data', 'ir', 'jc'])

字典的形状如下:

file["Y_RMRIO"]["data"].shape
Out[11]: (22037784,)

file["Y_RMRIO"]["ir"].shape
Out[12]: (22037784,)

file["Y_RMRIO"]["jc"].shape
Out[13]: (1135,)

如何导入 *.mat 文件并保持矩阵的形状 (1134,30807) 或将导入的数据再次变成形状(例如 np.array 或 pd.DataFrame)?

如果我没记错的话,至少有一本词典包含有关矩阵中数据点“位置”的信息。所以我想数据点可以插入到一个数组中的正确位置,中间有零(或者插入到具有正确维度的 np.zeros 数组中)。然后可以将数组重塑为所需的形状... ?

欢迎任何帮助。非常感谢!

这个文件看起来比我预期的要简单很多:

In [1]: import h5py
In [2]: f = h5py.File("../Downloads/test.mat")
In [3]: f.keys()
Out[3]: <KeysViewHDF5 ['Y_RMRIO']>
In [4]: f["Y_RMRIO"]
Out[4]: <HDF5 group "/Y_RMRIO" (3 members)>
In [5]: f["Y_RMRIO"].keys()
Out[5]: <KeysViewHDF5 ['data', 'ir', 'jc']>'

数据类型很简单(不是对象):

In [7]: f["Y_RMRIO/data"]
Out[7]: <HDF5 dataset "data": shape (22037784,), type "<f8">
In [8]: f["Y_RMRIO/ir"]
Out[8]: <HDF5 dataset "ir": shape (22037784,), type "<u8">
In [9]: f["Y_RMRIO/jc"]
Out[9]: <HDF5 dataset "jc": shape (1135,), type "<u8">

采样

In [10]: f["Y_RMRIO/data"][:10]
Out[10]: 
array([4.21597593e+01, 1.35612280e+02, 9.33348907e+02, 4.96704718e+01,
       8.64967748e-01, 1.23079072e+00, 6.43015281e+01, 1.49868605e+01,
       3.12984149e+02, 2.01720297e+01])
In [11]: f["Y_RMRIO/ir"][:10]
Out[11]: array([ 1,  2,  3,  4,  6,  7,  8,  9, 10, 11], dtype=uint64)
In [13]: f["Y_RMRIO/jc"][:10]
Out[13]: 
array([     0,  25021,  46743,  69537,  92648, 117807, 117807, 143254,
       165303, 189014], dtype=uint64)

我想知道 irjc 是否是稀疏矩阵的行和列索引:

In [15]: f["Y_RMRIO/ir"][:].max()
Out[15]: 30806
In [16]: f["Y_RMRIO/jc"][:].max()
Out[16]: 22037784

我认为jcindptr属性,ircsc格式稀疏矩阵的indices

In [17]: from scipy import sparse
In [18]: M = sparse.csc_matrix((f["Y_RMRIO/data"], f["Y_RMRIO/ir"], f["Y_RMRIO/jc"]))
In [19]: M
Out[19]: 
<30807x1134 sparse matrix of type '<class 'numpy.float64'>'
    with 22037784 stored elements in Compressed Sparse Column format>