如何从 sklearn 中的 .mat 文件拆分训练和测试数据？

Question

我有一个 mnist 数据集作为 .mat 文件，并且想用 sklearn 拆分训练和测试数据。 sklearn 读取 .mat 文件如下：

{'__header__': b'MATLAB 5.0 MAT-file, Platform: GLNXA64, Created on: Sat Oct  8 18:13:47 2016',
 '__version__': '1.0',
 '__globals__': [],
 'train_fea1': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'train_gnd1': array([[ 1],
        [ 1],
        [ 1],
        ...,
        [10],
        [10],
        [10]], dtype=uint8),
 'test_fea1': array([[ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ...,  0,  0,  0],
        ...,
        [ 0,  0,  0, ...,  0,  0,  0],
        [ 0,  0,  0, ..., 64,  0,  0],
        [ 0,  0,  0, ..., 25,  0,  0]], dtype=uint8),
 'test_gnd1': array([[ 1],
        [ 1],
        [ 1],
        ...,
        [10],
        [10],
        [10]], dtype=uint8)}

怎么做？

Answer 1

我猜你的意思是你使用 scipy 而不是 sklearn 将 .mat 数据文件加载到 Python 中。本质上，一个 .mat 数据文件可以像这样加载：

import scipy.io
scipy.io.loadmat('your_dot_mat_file')

scipy 将其读作 Python 字典。因此，在您的情况下，您读取的数据被分成 train: train_fea1，具有 train-label train_gnd1 和 test: test_fea1 具有 test-label test_gnd1.

要访问您的数据，您可以：

import scipy.io as sio
data = sio.loadmat('filename.mat')

train = data['train_fea1']
trainlabel = data['train_gnd1']

test = data['test_fea1']
testlabel = data['test_gnd1']

但是，如果您使用 sklearn 的 train-test-split 拆分数据，您可以先组合数据中的特征和标签，然后像这样随机拆分（按上述方式加载数据后):

import numpy as np
from sklearn.model_selection import train_test_split

X = np.vstack((train,test))
y = np.vstack((trainlabel, testlabel))

X_train, X_test, y_train, y_test = train_test_split(X, y, \
     test_size=0.2, random_state=42) #random seed for reproducible split

如何从 sklearn 中的 .mat 文件拆分训练和测试数据？

how to split train and test data from a .mat file in sklearn?

scikit-learn

mnist

mat

train-test-split