将 Keras 与仅带有标签的 HDF5Matrix 一起使用
Using Keras with HDF5Matrix with labels only
我相信这是我在 Stack Overflow 中的第一个问题,所以如果我没有遵循所有指南,我提前道歉。
我最近开始使用 Keras 进行深度学习,并且由于我使用 h5py 处理 HDF5 文件来管理大型数据集,所以我搜索了一种在非常大的 HDF5 文件上使用 keras 训练模型的方法。我发现最常见的方法是使用 keras.utils.io_utils.
中的 HDF5Matrix
我修改了一个 Keras 示例 (mnist.cnn) 如下:
'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
# My Imports
from os.path import exists
import h5py
from keras.utils.io_utils import HDF5Matrix
batch_size = 128
num_classes = 10
epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
#-----------------------------------HDF5 files creation---------------------------------------
sample_file_name = "x.hdf5"
solution_file_name = "y.hdf5"
train_name = "train"
test_name = "test"
#Create dataset
if (not exists(sample_file_name)) and (not exists(solution_file_name)):
samples_file = h5py.File(sample_file_name,mode='a')
solutions_file = h5py.File(solution_file_name,mode='a')
samples_train = samples_file.create_dataset(train_name,data=x_train)
samples_test = samples_file.create_dataset(test_name, data=x_test)
solution_train = solutions_file.create_dataset(train_name, data=y_train)
solution_test = solutions_file.create_dataset(test_name, data=y_test)
samples_file.flush()
samples_file.close()
solutions_file.flush()
solutions_file.close()
x_train = HDF5Matrix(sample_file_name,train_name)
x_test = HDF5Matrix(sample_file_name,test_name)
y_train = HDF5Matrix(solution_file_name,train_name)
y_test = HDF5Matrix(solution_file_name,test_name)
#---------------------------------------------------------------------------------------------
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# If using HDF5Matrix one needs to disable shuffle
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
shuffle=False)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
但是,有件事让我很担心。在分段问题 \ multi-class 问题中,classes 的数量非常大,以分类格式保存解决方案非常浪费。此外,这样做意味着一旦你添加了一个新的 class,整个数据集应该相应地改变。
这就是为什么我想使用 HDF5Matrix 的规范器功能如下:
'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
# My Imports
from os.path import exists
import h5py
from keras.utils.io_utils import HDF5Matrix
batch_size = 128
num_classes = 10
epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
#-----------------------------------HDF5 files creation---------------------------------------
sample_file_name = "x.hdf5"
solution_file_name = "y.hdf5"
train_name = "train"
test_name = "test"
#Create dataset
if (not exists(sample_file_name)) and (not exists(solution_file_name)):
samples_file = h5py.File(sample_file_name,mode='a')
solutions_file = h5py.File(solution_file_name,mode='a')
samples_train = samples_file.create_dataset(train_name,data=x_train)
samples_test = samples_file.create_dataset(test_name, data=x_test)
solution_train = solutions_file.create_dataset(train_name, data=y_train)
solution_test = solutions_file.create_dataset(test_name, data=y_test)
samples_file.flush()
samples_file.close()
solutions_file.flush()
solutions_file.close()
x_train = HDF5Matrix(sample_file_name,train_name)
x_test = HDF5Matrix(sample_file_name,test_name)
y_train = HDF5Matrix(solution_file_name,train_name,normalizer=lambda solution: keras.utils.to_categorical(solution,num_classes))
y_test = HDF5Matrix(solution_file_name,test_name,normalizer=lambda solution: keras.utils.to_categorical(solution,num_classes))
#---------------------------------------------------------------------------------------------
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# If using HDF5Matrix one needs to disable shuffle
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
shuffle=False)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
但是,这会产生一个错误,暗示解的形状应该匹配,并且不应以这种方式使用归一化器:
ValueError: Error when checking target: expected dense_2 to have 2, but got array with shape (60000, 1, 10)
那么,有没有办法将数据保存在 HDF5 中(如果不可能,使用其他格式),并以保存标签(而不是分类向量)的方式使用 Keras,而不将其转换为回归问题?
您收到此错误是因为 these 行。
Keras 在训练前检查输入形状。问题是如果你调用 HDF5Matrix 将 return 预规范化的形状 .shape
,那么 Keras 会相信你有一个 (60000, ) y_train
的数组和 y_test
.
的 (10000,)
但是,当访问矩阵的一部分时,会应用归一化器,因此 y_train[5:7].shape
确实具有最终预期的形状:(2, 10) .
这主要是因为规范化器并没有真正改变形状,但 Keras 确实可以处理这种情况。
您可以使用 fit_generator
而不是 fit
来修复它,以便训练只看到标准化数据:
def generator(features, labels, size):
while True:
start, end = 0, size
while end < len(features):
s = slice(start, end)
# you can actually do the normalization here if you want
yield features[s], labels[s]
start, end = end, end + size
model.fit_generator(
generator(x_train, y_train, batch_size),
steps_per_epoch=len(x_train) // batch_size,
epochs=1,
verbose=1,
validation_data=generator(x_test, y_test, batch_size),
validation_steps=len(x_test) // batch_size,
shuffle=False)
请注意,您可以在生成器函数内进行任何类型的规范化,这对 Keras 来说是透明的。您可以使用不同的批量大小进行训练和验证。
另外,你必须以同样的方式更改评估:
score = model.evaluate_generator(
generator(x_test, y_test, batch_size),
steps=len(x_test) // batch_size)
顺便说一句,我认为您使用标准化器的解决方案是个好主意。
我相信这是我在 Stack Overflow 中的第一个问题,所以如果我没有遵循所有指南,我提前道歉。 我最近开始使用 Keras 进行深度学习,并且由于我使用 h5py 处理 HDF5 文件来管理大型数据集,所以我搜索了一种在非常大的 HDF5 文件上使用 keras 训练模型的方法。我发现最常见的方法是使用 keras.utils.io_utils.
中的 HDF5Matrix我修改了一个 Keras 示例 (mnist.cnn) 如下:
'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
# My Imports
from os.path import exists
import h5py
from keras.utils.io_utils import HDF5Matrix
batch_size = 128
num_classes = 10
epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
#-----------------------------------HDF5 files creation---------------------------------------
sample_file_name = "x.hdf5"
solution_file_name = "y.hdf5"
train_name = "train"
test_name = "test"
#Create dataset
if (not exists(sample_file_name)) and (not exists(solution_file_name)):
samples_file = h5py.File(sample_file_name,mode='a')
solutions_file = h5py.File(solution_file_name,mode='a')
samples_train = samples_file.create_dataset(train_name,data=x_train)
samples_test = samples_file.create_dataset(test_name, data=x_test)
solution_train = solutions_file.create_dataset(train_name, data=y_train)
solution_test = solutions_file.create_dataset(test_name, data=y_test)
samples_file.flush()
samples_file.close()
solutions_file.flush()
solutions_file.close()
x_train = HDF5Matrix(sample_file_name,train_name)
x_test = HDF5Matrix(sample_file_name,test_name)
y_train = HDF5Matrix(solution_file_name,train_name)
y_test = HDF5Matrix(solution_file_name,test_name)
#---------------------------------------------------------------------------------------------
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# If using HDF5Matrix one needs to disable shuffle
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
shuffle=False)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
但是,有件事让我很担心。在分段问题 \ multi-class 问题中,classes 的数量非常大,以分类格式保存解决方案非常浪费。此外,这样做意味着一旦你添加了一个新的 class,整个数据集应该相应地改变。 这就是为什么我想使用 HDF5Matrix 的规范器功能如下:
'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
# My Imports
from os.path import exists
import h5py
from keras.utils.io_utils import HDF5Matrix
batch_size = 128
num_classes = 10
epochs = 12
# input image dimensions
img_rows, img_cols = 28, 28
# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
if K.image_data_format() == 'channels_first':
x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
input_shape = (1, img_rows, img_cols)
else:
x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')
#-----------------------------------HDF5 files creation---------------------------------------
sample_file_name = "x.hdf5"
solution_file_name = "y.hdf5"
train_name = "train"
test_name = "test"
#Create dataset
if (not exists(sample_file_name)) and (not exists(solution_file_name)):
samples_file = h5py.File(sample_file_name,mode='a')
solutions_file = h5py.File(solution_file_name,mode='a')
samples_train = samples_file.create_dataset(train_name,data=x_train)
samples_test = samples_file.create_dataset(test_name, data=x_test)
solution_train = solutions_file.create_dataset(train_name, data=y_train)
solution_test = solutions_file.create_dataset(test_name, data=y_test)
samples_file.flush()
samples_file.close()
solutions_file.flush()
solutions_file.close()
x_train = HDF5Matrix(sample_file_name,train_name)
x_test = HDF5Matrix(sample_file_name,test_name)
y_train = HDF5Matrix(solution_file_name,train_name,normalizer=lambda solution: keras.utils.to_categorical(solution,num_classes))
y_test = HDF5Matrix(solution_file_name,test_name,normalizer=lambda solution: keras.utils.to_categorical(solution,num_classes))
#---------------------------------------------------------------------------------------------
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
# If using HDF5Matrix one needs to disable shuffle
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
validation_data=(x_test, y_test),
shuffle=False)
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
但是,这会产生一个错误,暗示解的形状应该匹配,并且不应以这种方式使用归一化器:
ValueError: Error when checking target: expected dense_2 to have 2, but got array with shape (60000, 1, 10)
那么,有没有办法将数据保存在 HDF5 中(如果不可能,使用其他格式),并以保存标签(而不是分类向量)的方式使用 Keras,而不将其转换为回归问题?
您收到此错误是因为 these 行。
Keras 在训练前检查输入形状。问题是如果你调用 HDF5Matrix 将 return 预规范化的形状 .shape
,那么 Keras 会相信你有一个 (60000, ) y_train
的数组和 y_test
.
但是,当访问矩阵的一部分时,会应用归一化器,因此 y_train[5:7].shape
确实具有最终预期的形状:(2, 10) .
这主要是因为规范化器并没有真正改变形状,但 Keras 确实可以处理这种情况。
您可以使用 fit_generator
而不是 fit
来修复它,以便训练只看到标准化数据:
def generator(features, labels, size):
while True:
start, end = 0, size
while end < len(features):
s = slice(start, end)
# you can actually do the normalization here if you want
yield features[s], labels[s]
start, end = end, end + size
model.fit_generator(
generator(x_train, y_train, batch_size),
steps_per_epoch=len(x_train) // batch_size,
epochs=1,
verbose=1,
validation_data=generator(x_test, y_test, batch_size),
validation_steps=len(x_test) // batch_size,
shuffle=False)
请注意,您可以在生成器函数内进行任何类型的规范化,这对 Keras 来说是透明的。您可以使用不同的批量大小进行训练和验证。
另外,你必须以同样的方式更改评估:
score = model.evaluate_generator(
generator(x_test, y_test, batch_size),
steps=len(x_test) // batch_size)
顺便说一句,我认为您使用标准化器的解决方案是个好主意。