为什么模型子类化和 TFRecord 的组合不起作用？

Question

问题的简短版本

为什么当我尝试使用由 TFRecord 保存和加载的数据集来训练通过子类化（在 Keras 中）实现的模型时，它失败了？

问题的完整版本

我有以下模型（首先让我们在其中定义功能API）：

def get_model():
    input_layer = Input(shape=(6,), name="input")

    x = input_layer

    x = layers.Dense(128, activation='relu', name="dense_1")(x)
    x = layers.Dense(1024, activation='relu', name="dense_2")(x)
    x = layers.Dense(5120, activation='relu', name="dense_3")(x)

    a_out = layers.Dense(17, activation='softmax', name='a_out')(x)
    b_out = layers.Dense(27, activation='softmax', name='b_out')(x)
    c_out = layers.Dense(71, activation='softmax', name='c_out')(x)
    d_out = layers.Dense(29, activation='softmax', name='d_out')(x)

    model = models.Model(input_layer, [a_out, b_out, c_out, d_out])

    model.compile(optimizer='rmsprop',
                  loss=('sparse_categorical_crossentropy',
                        'sparse_categorical_crossentropy',
                        'sparse_categorical_crossentropy',
                        'sparse_categorical_crossentropy'))
    
    return model

它接受形状为 (6,) 的张量并输出 4 个不同的输出，a_out、b_out、c_out 和 d_out。每个都是一个整数（分类输出）。接下来我要定义一个 dummy/random 数据集来训练这个模型：

sample_count = 1000
inputs = np.random.rand(sample_count, 6).astype(np.float32)
targets = (
    np.random.randint(low=0, high=16, size=(sample_count,), dtype=np.int64),
    np.random.randint(low=0, high=26, size=(sample_count,), dtype=np.int64),
    np.random.randint(low=0, high=70, size=(sample_count,), dtype=np.int64),
    np.random.randint(low=0, high=28, size=(sample_count,), dtype=np.int64)
)
random_dataset = tf.data.Dataset.from_tensor_slices((inputs, targets))

for rec in random_dataset:
    print(rec)
    break

如果您调用功能性 API 模型的 fit 方法并为其提供此数据集，它将进行良好的训练。此外，前一个代码块中的 print 语句输出如下内容：

(<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.326234  , 0.9935627 , 0.65569717, 0.05908937, 0.7490394 ,
       0.7929646 ], dtype=float32)>, (<tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([60])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([9])>))

现在，让我们使用 TFRecord 保存和加载相同的数据集：

# Saving the random dataset into a TFRecord file
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    # If the value is an eager tensor BytesList won't unpack a string from an EagerTensor.
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() 
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

file_path = 'random.tfrec'
with tf.io.TFRecordWriter(file_path) as writer:
  for rec in random_dataset:
    feature = {
      'input': _bytes_feature(tf.io.serialize_tensor(rec[0])),
      'a_out': _int64_feature(rec[1][0]),
      'b_out': _int64_feature(rec[1][1]),
      'c_out': _int64_feature(rec[1][2]),
      'd_out': _int64_feature(rec[1][3]),
    }

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    writer.write(example_proto.SerializeToString())

# Load the dataset off the file just created
def read_tfrecord(serialized_example):
    feature_description = {
        'input': tf.io.FixedLenFeature((), tf.string),
        'a_out': tf.io.FixedLenFeature((), tf.int64),
        'b_out': tf.io.FixedLenFeature((), tf.int64),
        'c_out': tf.io.FixedLenFeature((), tf.int64),
        'd_out': tf.io.FixedLenFeature((), tf.int64)
    }

    example = tf.io.parse_single_example(serialized_example, feature_description)

    return tf.io.parse_tensor(example['input'], out_type=tf.float32), (
        example["a_out"],
        example["b_out"],
        example["c_out"],
        example["d_out"])

tfrecord_dataset = tf.data.TFRecordDataset(file_path).map(read_tfrecord)

for rec in tfrecord_dataset:
    print(rec)
    break

最后的打印语句只是完整性检查，以确保数据集在序列化过程中没有被扭曲。它输出类似：

(<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.326234  , 0.9935627 , 0.65569717, 0.05908937, 0.7490394 ,
       0.7929646 ], dtype=float32)>, (<tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([60])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([9])>))

这在各个方面都与原始数据集相同。如果我将这个 tfrecord_dataset 数据集提供给功能性 API 模型，它仍然可以很好地训练。接下来，我将使用继承定义相同的模型（A.K.A。子类化）：

class SubclassModel(keras.Model):
    def __init__(self):
        super(SubclassModel, self).__init__()

        self.d1 = layers.Dense(128, activation='relu', name="dense_1")
        self.d2 = layers.Dense(1024, activation='relu', name="dense_2")
        self.d3 = layers.Dense(5120, activation='relu', name="dense_3")

        self.a_out = layers.Dense(17, activation='softmax', name='a_out')
        self.b_out = layers.Dense(27, activation='softmax', name='b_out')
        self.c_out = layers.Dense(71, activation='softmax', name='c_out')
        self.d_out = layers.Dense(29, activation='softmax', name='d_out')

        self.build((None, 6,))
        self.compile(optimizer='rmsprop',
                     loss=('sparse_categorical_crossentropy',
                           'sparse_categorical_crossentropy',
                           'sparse_categorical_crossentropy',
                           'sparse_categorical_crossentropy'))

    def call(self, inputs, training=True):
        x = inputs
        
        x = self.d1(x)
        x = self.d2(x)
        x = self.d3(x)
        
        a = self.a_out(x)
        b = self.b_out(x)
        c = self.c_out(x)
        d = self.d_out(x)

        return a, b, c, d

这是妙语。现在，我有两种不同的方法来创建模型（函数 API 和继承）和两个不同的数据集（random_dataset 和 tfrecord_dataset）。这构成了四种不同的组合：

使用 random_dataset 训练函数 API 模型：工作正常
使用 tfrecord_dataset 训练函数 API 模型：工作正常
使用 random_dataset 训练 SubclassModel：工作正常
使用 tfrecord_dataset 训练 SubclassModel：失败！

这是我遇到的错误（截断）：

TypeError: in user code:

    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 808, in train_step
        y_pred = self(x, training=True)
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None

    TypeError: Exception encountered when calling layer "subclass_model_1" (type SubclassModel).
    
    in user code:
    
        File "/tmp/ipykernel_22298/1542980101.py", line 28, in call  *
            a = self.a_out(x)
        File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler  **
            raise e.with_traceback(filtered_tb) from None
        File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/activations.py", line 78, in softmax
            if x.shape.rank > 1:
    
        TypeError: Exception encountered when calling layer "a_out" (type Dense).
        
        '>' not supported between instances of 'NoneType' and 'int'
        
        Call arguments received:
          • inputs=tf.Tensor(shape=<unknown>, dtype=float32)
    
    
    Call arguments received:
      • inputs=tf.Tensor(shape=<unknown>, dtype=float32)
      • training=True

有谁知道我做错了什么吗？

Answer 1

对于可能面临同样问题的任何其他人，解决方案是在阅读 TFRecords 时重塑张量以匹配它们的预期形状：

def read_tfrecord(serialized_example):
    feature_description = {
        'input': tf.io.FixedLenFeature((), tf.string),
        'a_out': tf.io.FixedLenFeature((), tf.int64),
        'b_out': tf.io.FixedLenFeature((), tf.int64),
        'c_out': tf.io.FixedLenFeature((), tf.int64),
        'd_out': tf.io.FixedLenFeature((), tf.int64)
    }

    example = tf.io.parse_single_example(serialized_example, feature_description)

    return tf.reshape(tf.io.parse_tensor(example['input'], out_type=tf.float32), (6,)), (
        example["a_out"],
        example["b_out"],
        example["c_out"],
        example["d_out"])

为什么函数 API 没有抱怨这个但子类有，我无法理解。

为什么模型子类化和 TFRecord 的组合不起作用？

Why the combination of model subclassing and TFRecord does not work?

subclass

keras

tfrecord

问题的简短版本

问题的完整版本