如何使用上下文 window 来分割整个日志 Mel 频谱图(确保所有音频的段数相同)?

How to use a context window to segment a whole log Mel-spectrogram (ensuring the same number of segments for all the audios)?

我有几个时长不同的音频。所以我不知道如何确保音频的N个段数相同。我正在尝试实现现有的论文,据说首先使用 25 ms Hamming window 和 64 个 Mel 滤波器组在 20 到 8000 Hz 的整个音频中执行对数梅尔频谱图10 毫秒重叠。然后,为了得到它,我有以下代码行:

y, sr = librosa.load(audio_file, sr=None)
#sr = 22050
#len(y) = 237142
#duration = 5.377369614512472

n_mels = 64
n_fft = int(np.ceil(0.025*sr)) ## I'm not sure how to complete this parameter
win_length = int(np.ceil(0.025*sr)) # 0.025*22050
hop_length = int(np.ceil(0.010*sr)) #0.010 * 22050
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
M = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels,fmin=fmin, fmax=fmax)#, kwargs=M)
+ 1e-6)

# M.shape = (64, 532)

(我也不确定如何完成 n_fft 参数。) 然后,它说:

Use a context window of 64 frames to divide the whole log Mel-spectrogram into audio segments with size 64x64. A shift size of 30 frames is used during the segmentation, i.e. two adjacent segments are overlapped with 30 frames. Each divided segment hence has a length of 64 frames and its time duration is 10 ms x (64-1) + 25 ms = 655 ms.

所以,我卡在了这最后一部分,我不知道如何按 64x64 对 M 进行分割。我怎样才能为所有音频获得相同数量的片段(具有不同的持续时间),因为在最后我需要 64x64xN 特征作为我的神经网络或分类器的输入?我将不胜感激任何帮助!我是音频信号处理的初学者。

沿时间轴循环帧,一次向前移动 30 帧,并提取最后 64 帧的 window。在开始和结束时,您需要截断或填充数据以获得完整帧。

import librosa
import numpy as np
import math

audio_file = librosa.util.example_audio_file()
y, sr = librosa.load(audio_file, sr=None, duration=5.0) # only load 5 seconds

n_mels = 64
n_fft = int(np.ceil(0.025*sr))
win_length = int(np.ceil(0.025*sr))
hop_length = int(np.ceil(0.010*sr))
window = 'hamming'

fmin = 20
fmax = 8000

S = librosa.core.stft(y, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=window, center=False)
frames = np.log(librosa.feature.melspectrogram(y=y, sr=sr, S=S, n_mels=n_mels, fmin=fmin, fmax=fmax) + 1e-6)


window_size = 64
window_hop = 30

# truncate at start and end to only have windows full data
# alternative would be to zero-pad
start_frame = window_size 
end_frame = window_hop * math.floor(float(frames.shape[1]) / window_hop)

for frame_idx in range(start_frame, end_frame, window_hop):

    window = frames[:, frame_idx-window_size:frame_idx]
    assert window.shape == (n_mels, window_size)
    print('classify window', frame_idx, window.shape)

会输出

classify window 64 (64, 64)
classify window 94 (64, 64)
classify window 124 (64, 64)
...
classify window 454 (64, 64)

但是 windows 的数量将取决于音频样本的长度。所以如果只有相同数量的windows很重要,你需要确保所有音频样本的长度相同。