如何从 infimnist / mnist8m 数据集中读取图像和标签?

How to read images and labels from the infimnist / mnist8m dataset?

使用 link、https://leon.bottou.org/projects/infimnist 处的程序,我生成了一些数据。

据我所知,它是某种二进制格式:

b"\x00\x00\x08\x01\x00\x00'\x10\x07\x02\x01\x00\x04\x01\x04\t\x05 ...

我需要像这样从两个数据集中提取标签和图片,生成方式:

https://leon.bottou.org/projects/infimnist

with open("test10k-labels", "rb") as binary_file:
    data = binary_file.read()
    print(data)

>>> b"\x00\x00\x08\x01\x00\x00'\x10\x07\x02\x01\x00\x04\x01\x04\t\x05 ...

b"\x00\x00\x08\x01 ...".decode('ascii')

>>> "\x00\x00\x08\x01 ..."

我也试过binascii包,但是不行。

感谢您的帮助!

创建数据

要创建我所说的数据集,请从以下位置下载包 link:https://leon.bottou.org/projects/infimnist

$ cd dir_of_folder
$ make

然后我采用了弹出的生成的 infimnist 可执行文件的路径,并且:

$ app_path lab 10000 69999 > mnist60k-labels-idx1-ubyte

这应该将我使用的文件放在文件夹中。

app_path后面的命令可以用他在旁边列出的任何其他命令代替。

最后更新

有效! 使用一些 numpy 函数可以将图像恢复到正常方向。

# for the labels
with open(path, "rb") as binary_file:
    y_train = np.array(array("B", binary_file.read()))

# for the images
with open("images path", "rb") as binary_file:
    images = []
    emnistRotate = True
    magic, size, rows, cols = struct.unpack(">IIII", binary_file.read(16))
    if magic != 2051:
        raise ValueError('Magic number mismatch, expected 2051,''got {}'.format(magic))
    for i in range(size):
        images.append([0] * rows * cols)
    image_data = array("B", binary_file.read())
    for i in range(size):
        images[i][:] = image_data[i * rows * cols:(i + 1) * rows * cols]

        # for some reason EMNIST is mirrored and rotated
        if emnistRotate:
            x = image_data[i * rows * cols:(i + 1) * rows * cols]

            subs = []
            for r in range(rows):
                subs.append(x[(rows - r) * cols - cols:(rows - r)*cols])

            l = list(zip(*reversed(subs)))
            fixed = [item for sublist in l for item in sublist]
            images[i][:] = fixed
x = []
for image in images:
    x.append(np.rot90(np.flip(np.array(image).reshape((28,28)), 1), 1))
x_train = np.array(x)

如此简单的事情的疯狂解决方案:)

好的,所以查看python-mnist源码,似乎正确的解压二进制格式的方法如下:

from array import array
with open("test10k-labels", "rb") as binary_file:
    magic, size = struct.unpack(">II", file.read(8))
    if magic != 2049:
        raise ValueError("Magic number mismatch, expected 2049,got{}".format(magic))
    labels = array("B", binary_file.read())
    print(labels)

更新

所以我还没有对此进行广泛的测试,但下面的代码应该可以工作。它取自上述 python-mnist 参见 source

from array import array
import struct
with open("mnist8m-patterns-idx3-ubyte", "rb") as binary_file:
    images = []
    emnistRotate = True
    magic, size, rows, cols = struct.unpack(">IIII", binary_file.read(16))
    if magic != 2051:
        raise ValueError('Magic number mismatch, expected 2051,''got {}'.format(magic))
    for i in range(size):
        images.append([0] * rows * cols)
    image_data = array("B", binary_file.read())
    for i in range(size):
        images[i][:] = image_data[i * rows * cols:(i + 1) * rows * cols]

        # for some reason EMNIST is mirrored and rotated
        if emnistRotate:
            x = image_data[i * rows * cols:(i + 1) * rows * cols]

            subs = []
            for r in range(rows):
                subs.append(x[(rows - r) * cols - cols:(rows - r)*cols])

            l = list(zip(*reversed(subs)))
            fixed = [item for sublist in l for item in sublist]
            images[i][:] = fixed
    print(images)

上一个回答:

您可以使用 python-mnist 库:

from mnist import MNIST
mndata = MNIST('./data')
images, labels = mndata.load_training()