我的 librosa MFCC 输出是否正确？我想我在使用 librosa MFCC 时得到了错误的帧数

Question

result=librosa.feature.mfcc(signal, 16000, n_mfcc=13, n_fft=2048, hop_length=400)
result.shape()

信号长 1 秒，采样率为 16000，我计算了 13 个 MFCC，跳数为 400。输出维度为 (13,41)。为什么我得到41帧，不应该是(time*sr/hop_length)=40吗？

Answer 1

TL;DR 回答

是的，这是正确的。

长答案

您正在使用时间序列作为输入 (signal)，这意味着 librosa 首先使用 melspectrogram function. It takes a bunch of arguments, of which you have already specified one (n_fft). It's important to note that melspectrogram 计算梅尔频谱图还提供两个参数 center 和 pad_mode 分别具有默认值 True 和 "reflect"。

来自文档：

pad_mode: string: If center=True, the padding mode to use at the edges of the signal. By default, STFT uses reflection padding.

center: boolean: If True, the signal y is padded so that frame t is centered at y[t * hop_length]. If False, then frame t begins at y[t * hop_length]

换句话说，默认情况下，librosa 使您的信号更长（垫）以支持居中。

如果您想避免这种行为，您应该将 center=False 传递给您的 mfcc 调用。

综上所述，当将 center 设置为 False 时，请记住，n_fft 长度为 2048，跳跃长度为 400，您不需要必须得到 (time*sr/hop_length)=40 帧，因为你还必须考虑 window 而不仅仅是 hop 长度（除非你以某种方式填充）。跳跃长度仅由您移动的样本数指定 window。

举一个极端的例子，考虑一个非常大的window和一个非常短的跳跃长度：假设10个样本（例如time=1s，sr=10Hz），一个window n_fft=9 和 hop_length=1 与 center=False 的长度。现在想象在 10 个样本上滑动 window。

   ◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◻︎
   ◻︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎◼︎
t  0123456789

◻︎ sample not covered by window
◼︎ sample covered by window

起初 window 从 t=0 开始，在 t=8 结束。我们可以将它移动多少次 hop_length 并且仍然期望它不会运行超出样本？恰好一次，直到它从 t=1 开始并在 t=9 结束。添加第一个未移位的帧，您将得到 2 帧。这与错误的 (time*sr/hop_length)=1*10/1=10.

明显不同

正确的是：(time*sr-n_fft)//hop_length+1=(1*10-9)//1+1=2，其中 // 表示 Python 式整数除法。

当使用默认值时，即 center=True，信号在两端用 n_fft // 2 个样本填充，因此 n_fft 不在方程中。

我的 librosa MFCC 输出是否正确？我想我在使用 librosa MFCC 时得到了错误的帧数

Is my output of librosa MFCC correct? I think I get the wrong number of frames when using librosa MFCC

python

audio

audio-processing

mfcc

librosa

TL;DR 回答

长答案