Tensorflow:无法理解 ctc_beam_search_decoder() 输出序列
Tensorflow: Can't understand ctc_beam_search_decoder() output sequence
我正在使用 Tensorflow 的 tf.nn.ctc_beam_search_decoder()
解码 RNN 的输出,进行一些多对多映射(即,每个网络单元的多个 softmax 输出)。
网络输出和波束搜索解码器的简化版本是:
import numpy as np
import tensorflow as tf
batch_size = 4
sequence_max_len = 5
num_classes = 3
y_pred = tf.placeholder(tf.float32, shape=(batch_size, sequence_max_len, num_classes))
y_pred_transposed = tf.transpose(y_pred,
perm=[1, 0, 2]) # TF expects dimensions [max_time, batch_size, num_classes]
logits = tf.log(y_pred_transposed)
sequence_lengths = tf.to_int32(tf.fill([batch_size], sequence_max_len))
decoded, log_probabilities = tf.nn.ctc_beam_search_decoder(logits,
sequence_length=sequence_lengths,
beam_width=3,
merge_repeated=False, top_paths=1)
decoded = decoded[0]
decoded_paths = tf.sparse_tensor_to_dense(decoded) # Shape: [batch_size, max_sequence_len]
with tf.Session() as session:
tf.global_variables_initializer().run()
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
decoded_paths = session.run(decoded_paths, feed_dict = {y_pred: softmax_outputs})
print(decoded_paths)
这种情况下的输出是:
[[0]
[1]
[1]
[1]]
我的理解是输出张量的维度应该是 [batch_size, max_sequence_len]
,每行包含找到的路径中相关 类 的索引。
在这种情况下,我希望输出类似于:
[[2, 0, 0, 0, 0],
[2, 2, 2, 2, 2],
[1, 2, 2, 2, 2],
[2, 2, 2, 2, 2]]
我对 ctc_beam_search_decoder
的工作原理有什么不了解的地方?
如tf.nn.ctc_beam_search_decoder documentation所示,输出的形状不是[batch_size, max_sequence_len]
。相反,它是
[batch_size, max_decoded_length[j]]
(在你的情况下 j=0
)。
基于 this paper (which is cited in the github repository 的第 2 节开头),max_decoded_length[0]
从上方以 max_sequence_len
为界,但它们不一定相等。相关引用为:
Let S be a set of training examples drawn from a fixed distribution
D_{XxZ}. The input space X = (R^m) is the set of all sequences of m
dimensional real valued vectors. The target space Z = L* is the set of
all sequences over the (finite) alphabet L of labels. In general, we
refer to elements of L* as label sequences or labellings. Each example
in S consists of a pair of sequences (x, z). The target sequence z =
(z1, z2, ..., zU) is at most as long as the input sequence x = (x1,
x2, ..., xT ), i.e. U<=T. Since the input and target sequences are
not generally the same length, there is no a priori way of aligning
them.
其实max_decoded_length[0]
取决于具体的矩阵softmax_outputs
。特别是,具有完全相同维度的两个这样的矩阵可能导致不同的max_decoded_length[0]
。
例如,如果您替换行
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
与行数
np.random.seed(7)
r=np.random.randint(0,100,size=(4,5,3))
softmax_outputs=r/np.sum(r,2).reshape(4,5,1)
你会得到输出
[[1 0 1]
[1 0 1]
[1 0 0]
[1 0 0]]
(在上面的示例中,softmax_outputs
由 logits 组成,它与您提供的矩阵具有完全相同的维度)。
另一方面,将种子更改为 np.random.seed(50)
会得到输出
[[1 0]
[1 0]
[1 0]
[0 1]]
P.S.
关于你问题的最后一部分:
In this case I would expect the output to be similar to:
[[2, 0, 0, 0, 0],
[2, 2, 2, 2, 2],
[1, 2, 2, 2, 2],
[2, 2, 2, 2, 2]]
请注意,根据documentation,num_classes
实际上代表num_labels + 1
。具体来说:
The inputs Tensor's innermost dimension size, num_classes
, represents
num_labels + 1
classes, where num_labels
is the number of true labels,
and the largest value (num_classes - 1
) is reserved for the blank
label.
For example, for a vocabulary containing 3 labels [a, b, c],
num_classes = 4
and the labels indexing is {a: 0, b: 1, c: 2, blank:
3}.
所以你的真实标签是 0 和 1,2 是为空白标签保留的。空白标签表示观察无标签的情况(第3.1节here):
A CTC network has a softmax output layer (Bridle, 1990) with one more
unit than there are labels in L. The activations of the first |L|
units are interpreted as the probabilities of observing the
corresponding labels at particular times. The activation of the extra
unit is the probability of observing a ‘blank’, or no label. Together,
these outputs define the probabilities of all possible ways of
aligning all possible label sequences with the input sequence.
我正在使用 Tensorflow 的 tf.nn.ctc_beam_search_decoder()
解码 RNN 的输出,进行一些多对多映射(即,每个网络单元的多个 softmax 输出)。
网络输出和波束搜索解码器的简化版本是:
import numpy as np
import tensorflow as tf
batch_size = 4
sequence_max_len = 5
num_classes = 3
y_pred = tf.placeholder(tf.float32, shape=(batch_size, sequence_max_len, num_classes))
y_pred_transposed = tf.transpose(y_pred,
perm=[1, 0, 2]) # TF expects dimensions [max_time, batch_size, num_classes]
logits = tf.log(y_pred_transposed)
sequence_lengths = tf.to_int32(tf.fill([batch_size], sequence_max_len))
decoded, log_probabilities = tf.nn.ctc_beam_search_decoder(logits,
sequence_length=sequence_lengths,
beam_width=3,
merge_repeated=False, top_paths=1)
decoded = decoded[0]
decoded_paths = tf.sparse_tensor_to_dense(decoded) # Shape: [batch_size, max_sequence_len]
with tf.Session() as session:
tf.global_variables_initializer().run()
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
decoded_paths = session.run(decoded_paths, feed_dict = {y_pred: softmax_outputs})
print(decoded_paths)
这种情况下的输出是:
[[0]
[1]
[1]
[1]]
我的理解是输出张量的维度应该是 [batch_size, max_sequence_len]
,每行包含找到的路径中相关 类 的索引。
在这种情况下,我希望输出类似于:
[[2, 0, 0, 0, 0],
[2, 2, 2, 2, 2],
[1, 2, 2, 2, 2],
[2, 2, 2, 2, 2]]
我对 ctc_beam_search_decoder
的工作原理有什么不了解的地方?
如tf.nn.ctc_beam_search_decoder documentation所示,输出的形状不是[batch_size, max_sequence_len]
。相反,它是
[batch_size, max_decoded_length[j]]
(在你的情况下 j=0
)。
基于 this paper (which is cited in the github repository 的第 2 节开头),max_decoded_length[0]
从上方以 max_sequence_len
为界,但它们不一定相等。相关引用为:
Let S be a set of training examples drawn from a fixed distribution D_{XxZ}. The input space X = (R^m) is the set of all sequences of m dimensional real valued vectors. The target space Z = L* is the set of all sequences over the (finite) alphabet L of labels. In general, we refer to elements of L* as label sequences or labellings. Each example in S consists of a pair of sequences (x, z). The target sequence z = (z1, z2, ..., zU) is at most as long as the input sequence x = (x1, x2, ..., xT ), i.e. U<=T. Since the input and target sequences are not generally the same length, there is no a priori way of aligning them.
其实max_decoded_length[0]
取决于具体的矩阵softmax_outputs
。特别是,具有完全相同维度的两个这样的矩阵可能导致不同的max_decoded_length[0]
。
例如,如果您替换行
softmax_outputs = np.array([[[0.1, 0.1, 0.8], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1], [0.8, 0.1, 0.1]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.7, 0.2], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]],
[[0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7], [0.1, 0.2, 0.7]]])
与行数
np.random.seed(7)
r=np.random.randint(0,100,size=(4,5,3))
softmax_outputs=r/np.sum(r,2).reshape(4,5,1)
你会得到输出
[[1 0 1]
[1 0 1]
[1 0 0]
[1 0 0]]
(在上面的示例中,softmax_outputs
由 logits 组成,它与您提供的矩阵具有完全相同的维度)。
另一方面,将种子更改为 np.random.seed(50)
会得到输出
[[1 0]
[1 0]
[1 0]
[0 1]]
P.S.
关于你问题的最后一部分:
In this case I would expect the output to be similar to:
[[2, 0, 0, 0, 0], [2, 2, 2, 2, 2], [1, 2, 2, 2, 2], [2, 2, 2, 2, 2]]
请注意,根据documentation,num_classes
实际上代表num_labels + 1
。具体来说:
The inputs Tensor's innermost dimension size,
num_classes
, representsnum_labels + 1
classes, wherenum_labels
is the number of true labels, and the largest value (num_classes - 1
) is reserved for the blank label.For example, for a vocabulary containing 3 labels [a, b, c],
num_classes = 4
and the labels indexing is {a: 0, b: 1, c: 2, blank: 3}.
所以你的真实标签是 0 和 1,2 是为空白标签保留的。空白标签表示观察无标签的情况(第3.1节here):
A CTC network has a softmax output layer (Bridle, 1990) with one more unit than there are labels in L. The activations of the first |L| units are interpreted as the probabilities of observing the corresponding labels at particular times. The activation of the extra unit is the probability of observing a ‘blank’, or no label. Together, these outputs define the probabilities of all possible ways of aligning all possible label sequences with the input sequence.