为什么在这个例子中 window_length/hop_length 与 librosa.core.stft 中的采样率相乘?

Why window_length/hop_length are multiplied with sample rate in librosa.core.stft in this example?

我是语音识别的新手,我正在 this implementation 中详细了解说话人验证。在 data_preprocess.py 中,作者使用 librosa 库。这是代码的简化版本:

def preprocess_data(data_dir, res_dir, N, M, tdsv_frame, sample_rate, nfft, window_len, hop_len):
    os.makedirs(res_dir, exist_ok=True)
    batch_frames = N * M * tdsv_frame
    batch_number = 0
    batch = []
    batch_len = 0
    for i, path in enumerate(tqdm(os.listdir(data_dir))):
        data, sr = librosa.core.load(os.path.join(data_dir, path), sr=sample_rate)
        S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))
        batch.append(S)
        batch_len += S.shape[1]
        if batch_len < batch_frames: continue
        batch = np.concatenate(batch, axis=1)[:,:batch_frames]
        np.save(os.path.join(res_dir, "voice_%d.npy" % batch_number), batch)
        batch_number += 1
        batch = []
        batch_len = 0


N = 2               # number of speakers of batch
M = 400             # number of utterances per speaker
tdsv_frame = 80     # feature size
sample_rate = 8000  # sampling rate
nfft = 512          # fft kernel size
window_len = 0.025  # window length (ms)
hop_len = 0.01      # hop size (ms)
data_dir = "./data/clean_testset_wav/"
res_dir = "./data/clean_testset_wav_prep/"

基于论文中的一个图形,他们想要创建一批大小为 (N*M)*tdsv_frame 的特征。

我想我理解window_length、hop_length的概念,但对我来说问题是作者如何设置这些参数。为什么我们应该像这里所做的那样用 sample_rate 乘以这些长度:

S = librosa.core.stft(y=data, n_fft=nfft, win_length=int(window_len * sample_rate), hop_length=int(hop_len * sample_rate))

谢谢。

librosa.core.stft 需要 win_length/hop_length 个样本。这对于数字信号处理来说是典型的,因为从根本上说,系统是基于每秒样本数(采样率)的离散系统。

然而,为了便于人类理解,在 seconds/milliseconds 中考虑这些时间更有意义。正如你的例子

window_len = 0.025  # window length (ms)
hop_len = 0.01      # hop size (ms)

因此,要从以秒为单位的时间变为以样本数为单位的时间,必须乘以采样率。

window_len和hop_len的单位是(ms),但是在librosa中应该是样本数

# of samples = sampling_rate * (ms)