将数据帧传递给keras顺序模型
Passing dataframe to keras sequential model
我正在尝试使用 keras.Sequential() 构建和训练一个简单的 MLP 模型。
但是,在每个训练时期之后,我都遇到了问题,我尝试评估模型在训练和测试数据上的当前状态。
我在几个不同的数据集上遇到了这个问题,其中之一是“CAR DEKHO 的汽车详细信息”数据集,您可以找到它 here
这是我目前所做的:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
我收到错误消息:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
我想这个错误是由于我将 X_train(这是一个数据帧)直接传递给网络。
我也试过再次使用:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
就像创建训练批次时一样,但它 returns 另一个错误:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
最后,我尝试使用:
y_pred_train = net.predict(X_train)
在这种情况下,奇怪的是我遇到了 OOM 错误,指的是形状为 [76571,76571]:
的张量
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
但是 X_train 数据报的形状是 (76571, 19),所以我不明白发生了什么。
正确的做法是什么?
你的代码大部分看起来没问题,问题一定出在你传递的数据上。
检查您提供的数据的内容和数据类型。
尝试将 pandas 切片转换为 np.arrays,重新检查它们的尺寸,然后将 np.arrays 提供给 load_array()。
也可以尝试小批量,例如 64(不是 5000)。
更新:
显然,当您将 X_batch
传递给您传递 tf.tensor 的模型时,但稍后当您传递整个 X_train
或 X_test
- 您传递 pd.DataFrames 并且模型变得混乱.
您应该只更改 2 行:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here
问题看起来与数据有关(正如 Poe Dator 所说)。我认为正在发生的事情是您的网络根据接收的数据批次具有某种输入形状。然后,当您尝试根据数据预测或调用您的网络时(这也会重新计算形状,因为它调用了 build() 函数),它会尝试将数据变成它期望的形状。我特别认为它期望的形状为 (batch, 1, 19) 但您的数据在 (76571, 19) 中,它找不到正确的形状。
解决此问题的几个简单步骤是:
- 调用 net.summary() 以查看它认为在训练前后得到的形状是什么
- 为第一层提供输入形状,net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
- 将 X 数据切成与训练数据相同的形状。
- 为您的数据添加一个维度,使其成为 (76571, 1, 19) 以明确塑造它。
同样如上所述,较小的批量最好。如果您对 tensorflow 没有太多经验,我还建议您使用 model.train() 方法而不是处理梯度。这可以节省您的代码,并且更容易确保您在训练期间正确处理您的模型。
我正在尝试使用 keras.Sequential() 构建和训练一个简单的 MLP 模型。 但是,在每个训练时期之后,我都遇到了问题,我尝试评估模型在训练和测试数据上的当前状态。
我在几个不同的数据集上遇到了这个问题,其中之一是“CAR DEKHO 的汽车详细信息”数据集,您可以找到它 here
这是我目前所做的:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
我收到错误消息:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
我想这个错误是由于我将 X_train(这是一个数据帧)直接传递给网络。
我也试过再次使用:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
就像创建训练批次时一样,但它 returns 另一个错误:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
最后,我尝试使用:
y_pred_train = net.predict(X_train)
在这种情况下,奇怪的是我遇到了 OOM 错误,指的是形状为 [76571,76571]:
的张量File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
但是 X_train 数据报的形状是 (76571, 19),所以我不明白发生了什么。
正确的做法是什么?
你的代码大部分看起来没问题,问题一定出在你传递的数据上。 检查您提供的数据的内容和数据类型。
尝试将 pandas 切片转换为 np.arrays,重新检查它们的尺寸,然后将 np.arrays 提供给 load_array()。
也可以尝试小批量,例如 64(不是 5000)。
更新:
显然,当您将 X_batch
传递给您传递 tf.tensor 的模型时,但稍后当您传递整个 X_train
或 X_test
- 您传递 pd.DataFrames 并且模型变得混乱.
您应该只更改 2 行:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here
问题看起来与数据有关(正如 Poe Dator 所说)。我认为正在发生的事情是您的网络根据接收的数据批次具有某种输入形状。然后,当您尝试根据数据预测或调用您的网络时(这也会重新计算形状,因为它调用了 build() 函数),它会尝试将数据变成它期望的形状。我特别认为它期望的形状为 (batch, 1, 19) 但您的数据在 (76571, 19) 中,它找不到正确的形状。
解决此问题的几个简单步骤是:
- 调用 net.summary() 以查看它认为在训练前后得到的形状是什么
- 为第一层提供输入形状,net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
- 将 X 数据切成与训练数据相同的形状。
- 为您的数据添加一个维度,使其成为 (76571, 1, 19) 以明确塑造它。
同样如上所述,较小的批量最好。如果您对 tensorflow 没有太多经验,我还建议您使用 model.train() 方法而不是处理梯度。这可以节省您的代码,并且更容易确保您在训练期间正确处理您的模型。