cifar10 随机化训练和测试集
cifar10 randomize train and test set
我想随机化 keras.datasets 库中存在的 CIFAR-10 数据集的 60000 个观察值。我知道它可能与构建神经网络无关,但我是 Python 新手,我想学习使用这种编程语言进行数据处理。
所以,为了导入数据集,我 运行
from keras.datasets import cifar10
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
这会自动给我一个训练集和测试集的默认细分;但我想混合它们。
我想到的步骤是:
- 连接形状为 (60000, 32, 32, 3) 的数据集 X 和数据集 Y 中的训练集和测试集形状 (60000, 1)
- 生成一些随机索引以对 X 和 Y 数据集进行子集化,例如,一个包含 50000 个 obs 的训练集和一个测试集10000 个观察
- 创建新数据集(采用 ndarray 格式)X_train, X_test , Y_train, Y_test 和原来的形状一样,这样我就可以开始训练我的卷积神经网络
但或许还有更快捷的方法。
我已经尝试了几个小时的不同方法,但我没有成功。有人可以帮我吗?非常感谢,谢谢。
您可以使用sklearn.model_selection.train_test_split
拆分数据。如果您想在每次 运行 代码时使用相同的随机索引选择,您可以设置 random_state
值,并且每次都会有相同的 test/train 拆分。
from keras.datasets import cifar10
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
# View first image
import matplotlib.pyplot as plt
plt.imshow(X_train[0])
plt.show()
import numpy as np
from sklearn.model_selection import train_test_split
# Concatenate train and test images
X = np.concatenate((X_train,X_test))
y = np.concatenate((Y_train,Y_test))
# Check shape
print(X.shape) # (60000, 32, 32, 3)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000, random_state=1234)
# Check shape
print(X_train.shape) # (50000, 32, 32, 3)
# View first image
plt.imshow(X_train[0])
plt.show()
这是您要求的完整演示。首先我们下载数据并随机化一次,然后取前 50K 用于训练,其余 10K 用于验证目的。
In [21]: import tensorflow
In [22]: import tensorflow.keras.datasets as datasets
In [23]: cifar10 = datasets.cifar10.load_data()
In [24]: (X_train, Y_train), (X_test, Y_test) = datasets.cifar10.load_data()
In [25]: X_train.shape, Y_train.shape
Out[25]: ((50000, 32, 32, 3), (50000, 1))
In [26]: X_test.shape, Y_test.shape
Out[26]: ((10000, 32, 32, 3), (10000, 1))
In [27]: import numpy as np
In [28]: X, Y = np.vstack((X_train, X_test)), np.vstack((Y_train, Y_test))
In [29]: X.shape, Y.shape
Out[29]: ((60000, 32, 32, 3), (60000, 1))
In [30]: # Shuffle only the training data along axis 0
...: def shuffle_train_data(X_train, Y_train):
...: """called after each epoch"""
...: perm = np.random.permutation(len(Y_train))
...: Xtr_shuf = X_train[perm]
...: Ytr_shuf = Y_train[perm]
...:
...: return Xtr_shuf, Ytr_shuf
In [31]: X_shuffled, Y_shuffled = shuffle_train_data(X, Y)
In [32]: (X_train_new, Y_train_new) = X_shuffled[:50000, ...], Y_shuffled[:50000, ...]
In [33]: (X_test_new, Y_test_new) = X_shuffled[50000:, ...], Y_shuffled[50000:, ...]
In [34]: X_train_new.shape, Y_train_new.shape
Out[34]: ((50000, 32, 32, 3), (50000, 1))
In [35]: X_test_new.shape, Y_test_new.shape
Out[35]: ((10000, 32, 32, 3), (10000, 1))
我们有一个函数 shuffle_train_data
,它始终如一地打乱数据,使示例及其标签保持相同的顺序。
我想随机化 keras.datasets 库中存在的 CIFAR-10 数据集的 60000 个观察值。我知道它可能与构建神经网络无关,但我是 Python 新手,我想学习使用这种编程语言进行数据处理。
所以,为了导入数据集,我 运行
from keras.datasets import cifar10
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
这会自动给我一个训练集和测试集的默认细分;但我想混合它们。 我想到的步骤是:
- 连接形状为 (60000, 32, 32, 3) 的数据集 X 和数据集 Y 中的训练集和测试集形状 (60000, 1)
- 生成一些随机索引以对 X 和 Y 数据集进行子集化,例如,一个包含 50000 个 obs 的训练集和一个测试集10000 个观察
- 创建新数据集(采用 ndarray 格式)X_train, X_test , Y_train, Y_test 和原来的形状一样,这样我就可以开始训练我的卷积神经网络
但或许还有更快捷的方法。
我已经尝试了几个小时的不同方法,但我没有成功。有人可以帮我吗?非常感谢,谢谢。
您可以使用sklearn.model_selection.train_test_split
拆分数据。如果您想在每次 运行 代码时使用相同的随机索引选择,您可以设置 random_state
值,并且每次都会有相同的 test/train 拆分。
from keras.datasets import cifar10
(X_train, Y_train), (X_test, Y_test) = cifar10.load_data()
# View first image
import matplotlib.pyplot as plt
plt.imshow(X_train[0])
plt.show()
import numpy as np
from sklearn.model_selection import train_test_split
# Concatenate train and test images
X = np.concatenate((X_train,X_test))
y = np.concatenate((Y_train,Y_test))
# Check shape
print(X.shape) # (60000, 32, 32, 3)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=10000, random_state=1234)
# Check shape
print(X_train.shape) # (50000, 32, 32, 3)
# View first image
plt.imshow(X_train[0])
plt.show()
这是您要求的完整演示。首先我们下载数据并随机化一次,然后取前 50K 用于训练,其余 10K 用于验证目的。
In [21]: import tensorflow
In [22]: import tensorflow.keras.datasets as datasets
In [23]: cifar10 = datasets.cifar10.load_data()
In [24]: (X_train, Y_train), (X_test, Y_test) = datasets.cifar10.load_data()
In [25]: X_train.shape, Y_train.shape
Out[25]: ((50000, 32, 32, 3), (50000, 1))
In [26]: X_test.shape, Y_test.shape
Out[26]: ((10000, 32, 32, 3), (10000, 1))
In [27]: import numpy as np
In [28]: X, Y = np.vstack((X_train, X_test)), np.vstack((Y_train, Y_test))
In [29]: X.shape, Y.shape
Out[29]: ((60000, 32, 32, 3), (60000, 1))
In [30]: # Shuffle only the training data along axis 0
...: def shuffle_train_data(X_train, Y_train):
...: """called after each epoch"""
...: perm = np.random.permutation(len(Y_train))
...: Xtr_shuf = X_train[perm]
...: Ytr_shuf = Y_train[perm]
...:
...: return Xtr_shuf, Ytr_shuf
In [31]: X_shuffled, Y_shuffled = shuffle_train_data(X, Y)
In [32]: (X_train_new, Y_train_new) = X_shuffled[:50000, ...], Y_shuffled[:50000, ...]
In [33]: (X_test_new, Y_test_new) = X_shuffled[50000:, ...], Y_shuffled[50000:, ...]
In [34]: X_train_new.shape, Y_train_new.shape
Out[34]: ((50000, 32, 32, 3), (50000, 1))
In [35]: X_test_new.shape, Y_test_new.shape
Out[35]: ((10000, 32, 32, 3), (10000, 1))
我们有一个函数 shuffle_train_data
,它始终如一地打乱数据,使示例及其标签保持相同的顺序。