如何使用具有可变长度特征和标签的 TF CTC 损失
How to use TF CTC loss with variable length features and labels
我想用 Tensorflow 实现一个带有 CTC 损失的语音识别器。输入特征具有可变长度,因为每个语音话语都可以具有可变长度。标签也有可变长度,因为每个转录都是不同的。我手动填充特征以创建批次,在我的模型中我有 tf.keras.layers.Masking() 层来创建掩码并通过网络传播。我还创建了带填充的标签批次。
这是一个虚拟示例。假设我有两个长度分别为 3 帧和 5 帧的话语。每一帧都由一个单一的特征表示(通常这将是 13 个 MFCC,但为了简单起见,我将其减少为一个)。因此,为了创建批次,我在末尾用 0 填充了短语句:
features = np.array([1.5 2.3 4.6 0.0 0.0],
[1.7 2.6 3.4 2.3 1.0])
标签是这些话语的转录。假设长度分别为 2 和 3。标签批量形状将为 [2, 3, 26],其中批量大小为 2,最大长度为 3,英文字符数为 26(单热编码)。
型号是:
input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26, return_sequences=True)(input_)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
损失函数是这样的:
def ctc_loss(y_true, y_pred):
# Do something here to get logit_length and label_length?
# ...
loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)
我的问题是如何获得logit_length和label_length。我假设 logit_length 编码在掩码中,但如果我这样做 y_pred._keras_mask,结果是 None。对于 label_length,信息在张量本身中,但我不确定获取它的最有效方法。
谢谢。
更新:
根据投友的回答,我用tf.math.count_nonzero得到label_length,我设置logit_length为logit层的长度。
因此损失函数内的形状为(批量大小 = 10):
y_true.shape = (10, None)
y_pred.shape = (10, None, 27)
label_length.shape = (10,1)
logit_lenght.shape = (10,1)
当然y_true和y_pred的'None'是不一样的,因为一个是batch的最大字符串长度,另一个是最大时间帧数批次的。但是,当我使用这些参数调用 model.fit() 并丢失 tf.keras.backend.ctc_batch_cost() 时,出现错误:
Traceback (most recent call last):
File "train.py", line 164, in <module>
model.fit(dataset, batch_size=batch_size, epochs=10)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
return self._call_flat(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
outputs = execute.execute(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
(1) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
[[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]
Function call stack:
train_function -> train_function
貌似是在抱怨y_true(92)的长度和y_pred(876)的长度不一样,我觉得不应该。我错过了什么?
至少对于Tensorflow的最新版本(2.2及以上),Softmax层支持掩码,掩码值的输出不是零,只是重复前面的值。
features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
[1.7, 2.6, 3.4 ,2.3 ,1.0]])
input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(2, return_sequences=True)(x)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
r = model(features)
print(r)
第一个样本的输出有重复值对应掩码:
<tf.Tensor: shape=(2, 5, 2), dtype=float32, numpy=array([[[0.53308547, 0.46691453],
[0.5477166 , 0.45228338],
[0.55216545, 0.44783455],
[0.55216545, 0.44783455],
[0.55216545, 0.44783455]],
[[0.532052 , 0.46794805],
[0.54557794, 0.454422 ],
[0.55263203, 0.44736794],
[0.56076777, 0.4392322 ],
[0.5722393 , 0.42776066]]], dtype=float32)>
为了获得序列的 non_masked 值( label_length ),我使用的是 tf.version == 2.2 并且对我有用:
get_mask = r._keras_mask
您可以从 get_mask 张量值中提取 label_length :
<tf.Tensor: shape=(2, 5), dtype=bool, numpy=array([[ True,
True, True, False, False],
[ True, True, True, True, True]])>
或者您可以通过计算张量 y_true 中不同于零的值来获得 label_length:
label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True)
对于 logit_length 的值,我看到的所有实现只是 return time_step 的长度,所以 logit_length 可以是:
logit_length = tf.ones(shape = (your_batch_size ,1 ) * time_step
或者你可以使用掩码张量来获得未掩码的 time_step :
logit_length = tf.reshape(tf.reduce_sum(
tf.cast(y_pred._keras_mask,tf.float32),axis=1),(your_batch_size,-1) )
这是一个完整的例子:
features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
[1.5, 2.3 ,4.6, 2.0 ,1.0]]).reshape(2,5,1)
labels = np.array([[1., 2. ,3., 0. ,0.],
[1., 2. ,3., 2. ,1.]]).reshape(2,5 )
input_ = tf.keras.Input(shape=(5,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(5, return_sequences=True)(x)# 5 is the number of classes + blank .(in your case == 26 + 1)
output_ = tf.keras.layers.Softmax(axis = -1)(x)
model = tf.keras.Model(input_,output_)
def ctc_loss(y_true, y_pred):
label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True)
logit_length = tf.reshape(tf.reduce_sum(
tf.cast(y_pred._keras_mask,tf.float32),axis=1),(2,-1) )
loss =tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,
label_length)
return tf.reduce_mean(loss)
model.compile(loss =ctc_loss , optimizer = 'adam')
model.fit(features , labels ,epoch = 10)
我想用 Tensorflow 实现一个带有 CTC 损失的语音识别器。输入特征具有可变长度,因为每个语音话语都可以具有可变长度。标签也有可变长度,因为每个转录都是不同的。我手动填充特征以创建批次,在我的模型中我有 tf.keras.layers.Masking() 层来创建掩码并通过网络传播。我还创建了带填充的标签批次。
这是一个虚拟示例。假设我有两个长度分别为 3 帧和 5 帧的话语。每一帧都由一个单一的特征表示(通常这将是 13 个 MFCC,但为了简单起见,我将其减少为一个)。因此,为了创建批次,我在末尾用 0 填充了短语句:
features = np.array([1.5 2.3 4.6 0.0 0.0],
[1.7 2.6 3.4 2.3 1.0])
标签是这些话语的转录。假设长度分别为 2 和 3。标签批量形状将为 [2, 3, 26],其中批量大小为 2,最大长度为 3,英文字符数为 26(单热编码)。
型号是:
input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26, return_sequences=True)(input_)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
损失函数是这样的:
def ctc_loss(y_true, y_pred):
# Do something here to get logit_length and label_length?
# ...
loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)
我的问题是如何获得logit_length和label_length。我假设 logit_length 编码在掩码中,但如果我这样做 y_pred._keras_mask,结果是 None。对于 label_length,信息在张量本身中,但我不确定获取它的最有效方法。
谢谢。
更新:
根据投友的回答,我用tf.math.count_nonzero得到label_length,我设置logit_length为logit层的长度。
因此损失函数内的形状为(批量大小 = 10):
y_true.shape = (10, None)
y_pred.shape = (10, None, 27)
label_length.shape = (10,1)
logit_lenght.shape = (10,1)
当然y_true和y_pred的'None'是不一样的,因为一个是batch的最大字符串长度,另一个是最大时间帧数批次的。但是,当我使用这些参数调用 model.fit() 并丢失 tf.keras.backend.ctc_batch_cost() 时,出现错误:
Traceback (most recent call last):
File "train.py", line 164, in <module>
model.fit(dataset, batch_size=batch_size, epochs=10)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
return self._call_flat(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
outputs = execute.execute(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
(1) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
[[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]
Function call stack:
train_function -> train_function
貌似是在抱怨y_true(92)的长度和y_pred(876)的长度不一样,我觉得不应该。我错过了什么?
至少对于Tensorflow的最新版本(2.2及以上),Softmax层支持掩码,掩码值的输出不是零,只是重复前面的值。
features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
[1.7, 2.6, 3.4 ,2.3 ,1.0]])
input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(2, return_sequences=True)(x)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
r = model(features)
print(r)
第一个样本的输出有重复值对应掩码:
<tf.Tensor: shape=(2, 5, 2), dtype=float32, numpy=array([[[0.53308547, 0.46691453],
[0.5477166 , 0.45228338],
[0.55216545, 0.44783455],
[0.55216545, 0.44783455],
[0.55216545, 0.44783455]],
[[0.532052 , 0.46794805],
[0.54557794, 0.454422 ],
[0.55263203, 0.44736794],
[0.56076777, 0.4392322 ],
[0.5722393 , 0.42776066]]], dtype=float32)>
为了获得序列的 non_masked 值( label_length ),我使用的是 tf.version == 2.2 并且对我有用:
get_mask = r._keras_mask
您可以从 get_mask 张量值中提取 label_length :
<tf.Tensor: shape=(2, 5), dtype=bool, numpy=array([[ True,
True, True, False, False],
[ True, True, True, True, True]])>
或者您可以通过计算张量 y_true 中不同于零的值来获得 label_length:
label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True)
对于 logit_length 的值,我看到的所有实现只是 return time_step 的长度,所以 logit_length 可以是:
logit_length = tf.ones(shape = (your_batch_size ,1 ) * time_step
或者你可以使用掩码张量来获得未掩码的 time_step :
logit_length = tf.reshape(tf.reduce_sum(
tf.cast(y_pred._keras_mask,tf.float32),axis=1),(your_batch_size,-1) )
这是一个完整的例子:
features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
[1.5, 2.3 ,4.6, 2.0 ,1.0]]).reshape(2,5,1)
labels = np.array([[1., 2. ,3., 0. ,0.],
[1., 2. ,3., 2. ,1.]]).reshape(2,5 )
input_ = tf.keras.Input(shape=(5,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(5, return_sequences=True)(x)# 5 is the number of classes + blank .(in your case == 26 + 1)
output_ = tf.keras.layers.Softmax(axis = -1)(x)
model = tf.keras.Model(input_,output_)
def ctc_loss(y_true, y_pred):
label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True)
logit_length = tf.reshape(tf.reduce_sum(
tf.cast(y_pred._keras_mask,tf.float32),axis=1),(2,-1) )
loss =tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,
label_length)
return tf.reduce_mean(loss)
model.compile(loss =ctc_loss , optimizer = 'adam')
model.fit(features , labels ,epoch = 10)