使用 Tensorflow 的 Connectionist Temporal Classification (CTC) 实现
Using Tensorflow's Connectionist Temporal Classification (CTC) implementation
我尝试在 contrib 包 (tf.contrib.ctc.ctc_loss) 下使用 Tensorflow 的 CTC 实现,但没有成功。
- 首先,有人知道我在哪里可以阅读好的分步教程吗? Tensorflow 的文档在这个主题上很差。
- 我是否必须向 ctc_loss 提供空白标签是否交错的标签?
- 即使使用超过 200 个时期的长度为 1 的训练数据集,我也无法过度拟合我的网络。 :(
- 如何使用tf.edit_distance计算标签错误率?
这是我的代码:
with graph.as_default():
max_length = X_train.shape[1]
frame_size = X_train.shape[2]
max_target_length = y_train.shape[1]
# Batch size x time steps x data width
data = tf.placeholder(tf.float32, [None, max_length, frame_size])
data_length = tf.placeholder(tf.int32, [None])
# Batch size x max_target_length
target_dense = tf.placeholder(tf.int32, [None, max_target_length])
target_length = tf.placeholder(tf.int32, [None])
# Generating sparse tensor representation of target
target = ctc_label_dense_to_sparse(target_dense, target_length)
# Applying LSTM, returning output for each timestep (y_rnn1,
# [batch_size, max_time, cell.output_size]) and the final state of shape
# [batch_size, cell.state_size]
y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), # num_proj=num_classes
data,
dtype=tf.float32,
sequence_length=data_length,
)
# For sequence labelling, we want a prediction for each timestamp.
# However, we share the weights for the softmax layer across all timesteps.
# How do we do that? By flattening the first two dimensions of the output tensor.
# This way time steps look the same as examples in the batch to the weight matrix.
# Afterwards, we reshape back to the desired shape
# Reshaping
logits = tf.transpose(y_rnn1, perm=(1, 0, 2))
# Get the loss by calculating ctc_loss
# Also calculates
# the gradient. This class performs the softmax operation for you, so inputs
# should be e.g. linear projections of outputs by an LSTM.
loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))
# Define our optimizer with learning rate
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
# Decoding using beam search
decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)
谢谢!
更新 (06/29/2016)
谢谢@jihyeon-seo!所以,我们在 RNN 的输入端有类似 [num_batch、max_time_step、num_features] 的东西。我们使用 dynamic_rnn 执行给定输入的循环计算,输出形状为 [num_batch、max_time_step、num_hidden] 的张量。之后,我们需要在每个 tilmestep 中进行仿射投影并共享权重,因此我们必须重塑为 [num_batch*max_time_step, num_hidden],乘以权重矩阵shape [num_hidden, num_classes], 求和 bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input),这个结果将作为ctc_loss函数的输入。我做的一切都正确吗?
这是代码:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
更新 (07/11/2016)
谢谢@Xiv。这是错误修复后的代码:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
self._logits = tf.transpose(self._logits, (1,0,2))
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
更新 (07/25/16)
我 published 我的代码的 GitHub 部分,使用一种表达方式。请放心使用! :)
我正在尝试做同样的事情。
以下是我发现您可能感兴趣的内容。
确实很难找到 CTC 的教程,但是 this example was helpful。
而对于空白标签,CTC layer assumes that the blank index is num_classes - 1
,因此您需要为空白标签提供额外的 class。
此外,CTC 网络执行 softmax 层。在您的代码中,RNN 层连接到 CTC 损失层。 RNN层的输出是内激活的,所以需要再增加一层没有激活函数的隐藏层(也可以是输出层),然后再增加CTC损失层。
有关双向 LSTM、CTC 和编辑距离实现的示例,请参阅 here,在 TIMIT 语料库上训练音素识别模型。如果你在那个语料库的训练集上训练,你应该能够在 120 次左右后将音素错误率降低到 20-25%。
我尝试在 contrib 包 (tf.contrib.ctc.ctc_loss) 下使用 Tensorflow 的 CTC 实现,但没有成功。
- 首先,有人知道我在哪里可以阅读好的分步教程吗? Tensorflow 的文档在这个主题上很差。
- 我是否必须向 ctc_loss 提供空白标签是否交错的标签?
- 即使使用超过 200 个时期的长度为 1 的训练数据集,我也无法过度拟合我的网络。 :(
- 如何使用tf.edit_distance计算标签错误率?
这是我的代码:
with graph.as_default():
max_length = X_train.shape[1]
frame_size = X_train.shape[2]
max_target_length = y_train.shape[1]
# Batch size x time steps x data width
data = tf.placeholder(tf.float32, [None, max_length, frame_size])
data_length = tf.placeholder(tf.int32, [None])
# Batch size x max_target_length
target_dense = tf.placeholder(tf.int32, [None, max_target_length])
target_length = tf.placeholder(tf.int32, [None])
# Generating sparse tensor representation of target
target = ctc_label_dense_to_sparse(target_dense, target_length)
# Applying LSTM, returning output for each timestep (y_rnn1,
# [batch_size, max_time, cell.output_size]) and the final state of shape
# [batch_size, cell.state_size]
y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), # num_proj=num_classes
data,
dtype=tf.float32,
sequence_length=data_length,
)
# For sequence labelling, we want a prediction for each timestamp.
# However, we share the weights for the softmax layer across all timesteps.
# How do we do that? By flattening the first two dimensions of the output tensor.
# This way time steps look the same as examples in the batch to the weight matrix.
# Afterwards, we reshape back to the desired shape
# Reshaping
logits = tf.transpose(y_rnn1, perm=(1, 0, 2))
# Get the loss by calculating ctc_loss
# Also calculates
# the gradient. This class performs the softmax operation for you, so inputs
# should be e.g. linear projections of outputs by an LSTM.
loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))
# Define our optimizer with learning rate
optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)
# Decoding using beam search
decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)
谢谢!
更新 (06/29/2016)
谢谢@jihyeon-seo!所以,我们在 RNN 的输入端有类似 [num_batch、max_time_step、num_features] 的东西。我们使用 dynamic_rnn 执行给定输入的循环计算,输出形状为 [num_batch、max_time_step、num_hidden] 的张量。之后,我们需要在每个 tilmestep 中进行仿射投影并共享权重,因此我们必须重塑为 [num_batch*max_time_step, num_hidden],乘以权重矩阵shape [num_hidden, num_classes], 求和 bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input),这个结果将作为ctc_loss函数的输入。我做的一切都正确吗?
这是代码:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
更新 (07/11/2016)
谢谢@Xiv。这是错误修复后的代码:
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)
# Reshaping to share weights accross timesteps
x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])
self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1
# Reshaping
self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
self._logits = tf.transpose(self._logits, (1,0,2))
# Calculating loss
loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)
self.cost = tf.reduce_mean(loss)
更新 (07/25/16)
我 published 我的代码的 GitHub 部分,使用一种表达方式。请放心使用! :)
我正在尝试做同样的事情。 以下是我发现您可能感兴趣的内容。
确实很难找到 CTC 的教程,但是 this example was helpful。
而对于空白标签,CTC layer assumes that the blank index is num_classes - 1
,因此您需要为空白标签提供额外的 class。
此外,CTC 网络执行 softmax 层。在您的代码中,RNN 层连接到 CTC 损失层。 RNN层的输出是内激活的,所以需要再增加一层没有激活函数的隐藏层(也可以是输出层),然后再增加CTC损失层。
有关双向 LSTM、CTC 和编辑距离实现的示例,请参阅 here,在 TIMIT 语料库上训练音素识别模型。如果你在那个语料库的训练集上训练,你应该能够在 120 次左右后将音素错误率降低到 20-25%。