转换 CNN（音频识别）的 MFCC 频谱图的输入

Question

我有一个音频数据集，我已经将这些音频转换成如下所示的介绍 MFCC 绘图：

现在我想为我的神经网络提供数据

import tensorflow as tf
import tensorflow.keras as tfk
import tensorflow.keras.layers as tfkl

cnn_model = tfk.Sequential(name='CNN_model')
cnn_model.add(tfkl.Conv1D(filters= 225, kernel_size= 11, padding='same', activation='relu', input_shape=(4500,9000, 3)))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.Bidirectional(tfkl.GRU(200, activation='relu', return_sequences=True, implementation=0)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())
cnn_model.compile(loss='mae', optimizer='Adam', metrics=['mae'])

cnn_model.summary()

我使用了Conv1D，因为它是这种神经网络中使用的层。但我不知道如何将数据从 图像转换为 CNN 的输入。自己尝试了几次改造，就是不行。

正如您在下图中看到的，我需要提供第一层 Conv1D 但我不能，因为我的图像的形状是 (4500, 9000, 3)。所以基本上，我想做的就是将此图像转换为 Conv1D 的输入 ，方法与下图相同。

此图像代表传递给 NN 的 1 个音频。

显然，当我将具有此形状的图像传递到 Conv1D 层时，我有一个 ValueError ValueError: Input 0 of layer conv1d_4 is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: [None, 4500, 9000, 3]

我将我的图像转为灰度，但不是方法，我丢失了一个有价值的信息。

Answer 1

我认为你可以将图像转换为灰度，但你可能会丢失大量有价值的数据。

最好的方法是重塑 MFCC 频谱图。 img.reshape(4500, 3 * 9000)

例子

# Sample data
>>> a
array([[[1, 1, 1],
        [2, 2, 2]],

       [[3, 3, 3],
        [4, 4, 4]]])
>>> a.shape
(2, 2, 3)

# Reshaping data
>>> a.reshape(2, -1)
array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])

# Or
>>> a.reshape(2, 6)
array([[1, 1, 1, 2, 2, 2],
       [3, 3, 3, 4, 4, 4]])

Answer 2

如果你说 X 是时间，那么考虑形状 (examples, time_steps, frequency_bins, img_channels)，你可以尝试一些东西。

选项 1

最明显的是在@skillsmuggler 的回答中提到的。不是时间的一切都是特征，所以：

#if in the model:
x_train = original_x_train
cnn_model.add(tfkl.Reshape((4500, 27000), input_shape=(4500,9000,3))) #first layer


#if directly in the data:
x_train = original_x_train.reshape((-1, 4500, 27000))
cnn_model.add(tfkl.Conv1D(filters= 225, kernel_size= 11, padding='same', 
                          activation='relu', input_shape=(4500,9000, 3))) #original first layer

选项 2

但还有更多的可能性。我不知道 MFCC 是什么，但我怀疑它是由以下材料制成的：

x = 时间步长
y = 频点
颜色=强度

如果是这样，首先要做的是获取强度的原始值而不是 3 通道像素值。对于网络来说，获得连续值的想法比 3 个通道以更复杂的方式变化来表示同一事物要容易得多（这些颜色仅适用于我们的人眼，但它们在数学上要复杂得多）

如果您可以访问原始值而不是颜色，那么您可以像 选项 2 输入 (examples, time_steps, frequency_bins)，就是这样，没有图像颜色通道。更少的输入，更好地表示信息。本例中的值为 "intensity".

print(x_train.shape) #-> (examples, 4500, 9000)

那么你的模型就不需要改变了。

选项 3

现在，如果你说你用上述方法丢失了信息，那么你可以尝试许多其他奇特的东西，我能想到的第一个是首先对频率维度进行卷积，以某种方式合并或折叠它，然后然后开始处理时间维度。

类似于这个两部分模型。

第 1 部分： 卷积和折叠频率。

input_channels = 1 or 3 #preferrably 1, following option 2, 
                        #but it's possible to use the 3 channel images too (less optimal)

cnn_model = tfk.Sequential(name='CNN_model')
cnn_model.add(tfkl.TimeDistributed(
                  Conv1D(filters, size, activation=...), 
                  input_shape=(4500,9000,input_channels))) 

#shapes will be all in the type (examples, 4500, decreasing_freq_size, increasing_channels)
#make a time distributed conv model in the VGG style until you collapse the last channel dimension
cnn_model.add(tfkl.TimeDistributed(Conv1D(...)))
...
cnn_model.add(tfkl.MaxPooling1D())
cnn_model.add(tfkl.TimeDistributed(Conv1D(...)))
...
cnn_model.add(tfkl.MaxPooling1D())
cnn_model.add(tfkl.TimeDistributed(Conv1D(...)))
...

#when the 9000 has been reduced a lot
import tf.keras.backend as K
cnn_model.add(tfkl.Lambda(lambda x: K.mean(x, axis=2))) 
    #the line above is equivalent to the following, but seems more efficient
    #cnn_model.add(tfkl.TimeDistributed(GlobalAveragePooling1D())) 

#new shape style: (examples, 4500, increased_channels) 
#no need for a huge number of channels, maybe around 100?
cnn_model.add(tfkl.Dense(units=around_100)) #Dense is equal to TimeDistributed(Dense)

现在您已经将 (examples, 4500, 9000, ch_1_or_3) 的形状转换为 (examples, 4500, features_around_100) 的形状，您可以转到第二部分，即您的原始模型。

第 2 部分：继续您的原始模型。

cnn_model.add(tfkl.Conv1D(filters= 225, kernel_size= 11, padding='same', activation='relu'))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.Bidirectional(
                  tfkl.GRU(200, activation='relu', return_sequences=True, implementation=0)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.BatchNormalization())
cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())
cnn_model.compile(loss='mae', optimizer='Adam', metrics=['mae'])

选项 4

可以与选项 3 一起使用。由于频率维度可能在垂直方向线性增加，并且由于卷积看不到它们作为一个整体进行卷积的维度，因此您可以添加一个具有归一化频率值（不是强度，而是实际频率）的通道，看看它是否增加了好处信息。

因此，作为示例，考虑形状为 (examples, 4500, 9000, channels_1_or_3) 的选项 2。选择一项：

在输入数据中：

freq_channel = (numpy.arange(9000) / 9000) - 0.5 #shape (9000,)
freq_channel = numpy.stack([freq_channel] * 4500, axis=0) #shape (4500,9000)
freq_channel = numpy.stach([freq_channel] * examples, axis=0) #shape (examples, 4500, 9000)
freq_channel = freq_channel.reshape((-1, 4500, 9000, 1))
new_x_train = numpy.concatenate([original_x_train, freq_channel], axis=-1))

模型中：

import tf.keras.backend as K
def add_freq_channel(x):
    shape = K.shape(x) #(examples, 4500, 9000, channels)
    shape = K.concatenate([shape[:-1], K.constant([1])]) #(examples, 4500, 9000, 1)

    freq_channel = (K.arange(9000) / 9000) - 0.5 #shape (9000,)
    freq_channel = K.reshape(freq_channel, (1, 1, 9000, 1))
    freq_channel = freq_channel * K.ones(shape)

    return K.concatenate([x, freq_channel], axis=-1)

cnn_model.add(tfkl.Lambda(add_freq_channel, input_shape=(4500,9000,channels)))

你也许可以（不确定它是否会带来改进），将这个想法也扩展到时间维度。按照上面的相同过程添加一个额外的通道，但关注 X 轴，大小为 4500。在这种情况下，您可以将它与任何其他选项一起使用。

对您的模型的一般建议

我不确定 GRU 是如何工作的，但由于它是经常性的，所以在这一层坚持使用 activation = 'tanh' 可能是一个更好的主意。我在某个地方读过，但不记得在哪里，'tanh' 激活至少对于 LSTM 层更好。可能是因为循环计算可能导致爆炸。（当然你可以测试一下，得出更好的结论）
tfkl.TimeDistributed(tfkl.Dense(20)) 在 Keras 中等于 tfkl.Dense(20)。您可以避免在此处添加 TimeDistributed 开销。

Answer 3

我觉得您没有将此视为典型的语音识别问题。因为我在你的方法中发现了几个奇怪的选择。

我注意到的问题

MFCC 运算的输出形状。

如果你看librosa.feature.mfcc，就是这样说的，

Returns: M:np.ndarray [shape=(n_mfcc, t)]

如您所见，这里没有频道。有输入维度 (n_mfcc) 和时间维度 (t)。因此，你应该可以直接使用Conv1D而不需要任何预处理。

SoftMax 之前的 Dropout

这就是你的算法尾部的样子，

cnn_model.add(tfkl.TimeDistributed(tfkl.Dense(20)))
cnn_model.add(tfkl.Dropout(0.2))
cnn_model.add(tfkl.Softmax())

就我个人而言，我没有使用过最后一层使用dropout的人。所以我会摆脱它。因为 dropout 会随机切换神经元。但是您希望所有输出节点随时打开。

损失函数

通常，CTC用于优化语音识别模型。我（个人）还没有看到任何人使用 mae 作为语音模型的损失。因为，您的输入数据和标签数据通常具有未对齐的时间维度。这意味着，并不总是有一个标签对应于预测的每个时间步长。这就是 CTC 损失的亮点。这可能就是您想要用于此模型的内容（除非您 100% 确定每个预测都有一个标签并且它们完全对齐）。

话虽如此，损失取决于您要解决的问题。但我将包含一个示例，说明如何使用此损失来解决此问题。

一个工作示例

数据集

为了展示一个工作示例，我将使用 this 语音数据集。我选择这个是因为，由于问题的简单性，我可以很快得到一个好的结果。

输入：音频
输出：一个标签0-9

MFCC 变换

然后你可以对音频文件进行MFCC，你会得到如下热图。所以正如我之前所说，这将是一个二维矩阵 (n_mfcc, timesteps) 大小的数组。随着批次维度的增加，(batch size, n_mfcc, timesteps).

以下是您如何可视化以上内容。这里，y 是通过 librosa.core.load() 函数加载的音频。

y = audios[aid][1][0]
sr = audios[aid][1][1]
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)
print(mfcc.shape)

plt.figure(figsize=(6, 4))
librosa.display.specshow(mfcc, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.tight_layout()

正在创建 training/testing 数据

接下来您可以创建训练和测试数据。这是我创建的。

train_data - (sample size, timesteps, n_mfcc) 大小数组
train_labels = A (sample size, timesteps, num_classes) 大小数组
train_inp_lengths - (sample size,)` 大小数组（用于 CTC 损失）
train_seq_lengths - A (sample size,)` 大小数组（用于 CTC 损失）
test_data - 一个(sample size, timesteps, n_mfcc)大小的数组
test_labels = A (sample size, timesteps, num_classes+1) 大小数组
test_inp_lengths - (sample size,)` 大小数组（用于 CTC 损失）
test_seq_lengths - (sample size,)` 大小数组（用于 CTC 损失）

我正在使用以下映射将字符转换为数字

alphabet = 'abcdefghijklmnopqrstuvwxyz '
a_map = {} # map letter to number
rev_a_map = {} # map number to letter
for i, a in enumerate(alphabet):
  a_map[a] = i
  rev_a_map[i] = a

label_map = {0:'zero', 1:'one', 2:'two', 3:'three', 4:'four', 5:'five', 6:'six', 7: 'seven', 8: 'eight', 9:'nine'}

有几点需要注意。

注意mfcc操作returns(n_mfcc, time)。您必须进行轴置换才能使其成为 (time, n_mfcc) 格式。这样卷积就发生在时间维度上。
我还必须确保标签具有与输入完全相同的时间步数（这对于 ctc_loss 不是必需的）。但这是 keras 模型定义强制执行的要求。这是通过在每个字符序列的末尾添加空格来完成的。

定义模型

我已经从顺序 API 更改为功能 API，因为我需要包含几个输入层才能使这项工作适用于 ctc_loss。此外，我去掉了最后一个 dropout 层。

def ctc_loss(inp_lengths, seq_lengths):
    def loss(y_true, y_pred):
        l = tf.reduce_mean(K.ctc_batch_cost(tf.argmax(y_true, axis=-1), y_pred, inp_lengths, seq_lengths))        
        return l            
    return loss

K.clear_session()
inp = tfk.Input(shape=(10,50))
inp_len = tfk.Input(shape=(1))
seq_len = tfk.Input(shape=(1))
out = tfkl.Conv1D(filters= 128, kernel_size= 5, padding='same', activation='relu')(inp)
out = tfkl.BatchNormalization()(out)
out = tfkl.Bidirectional(tfkl.GRU(128, return_sequences=True, implementation=0))(out)
out = tfkl.Dropout(0.2)(out)
out = tfkl.BatchNormalization()(out)
out = tfkl.TimeDistributed(tfkl.Dense(27, activation='softmax'))(out)
cnn_model = tfk.models.Model(inputs=[inp, inp_len, seq_len], outputs=out)
cnn_model.compile(loss=ctc_loss(inp_lengths=inp_len , seq_lengths=seq_len), optimizer='Adam', metrics=['mae'])

训练模型

然后你只需打电话，

cnn_model.fit([train_data, train_inp_lengths, train_seq_lengths], train_labels, batch_size=64, epochs=20)

这给了，

Train on 900 samples
Epoch 1/20
900/900 [==============================] - 3s 3ms/sample - loss: 11.4955 - mean_absolute_error: 0.0442
Epoch 2/20
900/900 [==============================] - 2s 2ms/sample - loss: 4.1317 - mean_absolute_error: 0.0340
...
Epoch 19/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1162 - mean_absolute_error: 0.0275
Epoch 20/20
900/900 [==============================] - 2s 2ms/sample - loss: 0.1012 - mean_absolute_error: 0.0277

使用模型进行预测

y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])

n_ids = 5

for pred, true in zip(y[:n_ids,:,:], test_labels[:n_ids,:,:]):
  pred_ids = np.argmax(pred,axis=-1)
  true_ids = np.argmax(true, axis=-1)
  print('pred > ',[rev_a_map[tid] for tid in pred_ids])
  print('true > ',[rev_a_map[tid] for tid in true_ids])

这给出了，

pred >  ['e', ' ', 'i', 'i', 'i', 'g', 'h', ' ', ' ', 't']
true >  ['e', 'i', 'g', 'h', 't', ' ', ' ', ' ', ' ', ' ']

pred >  ['o', ' ', ' ', 'n', 'e', ' ', ' ', ' ', ' ', ' ']
true >  ['o', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['s', 'e', ' ', ' ', ' ', ' ', ' ', ' ', 'v', 'e']
true >  ['s', 'e', 'v', 'e', 'n', ' ', ' ', ' ', ' ', ' ']

pred >  ['z', 'e', ' ', ' ', ' ', ' ', ' ', 'r', 'o', ' ']
true >  ['z', 'e', 'r', 'o', ' ', ' ', ' ', ' ', ' ', ' ']

pred >  ['n', ' ', ' ', 'i', 'i', 'n', 'e', ' ', ' ', ' ']
true >  ['n', 'i', 'n', 'e', ' ', ' ', ' ', ' ', ' ', ' ']

要去除中间的重复字母和空格，请使用 ctc_decode 函数，如下所示。

y = cnn_model.predict([test_data, test_inp_lengths, test_seq_lengths])

sess = K.get_session()
pred = sess.run(tf.keras.backend.ctc_decode(y, test_inp_lengths[:,0]))

rev_a_map[-1] = '-'

for pred, true in zip(pred[0][0][:n_ids,:], test_labels[:n_ids,:,:]):
  print(pred.shape)  
  true_ids = np.argmax(true, axis=-1)
  print('pred > ',[rev_a_map[tid] for tid in pred])
  print('true > ',[rev_a_map[tid] for tid in true_ids])