使用 python_speech_features 获取 96 个 MFCC 功能

Question

我想使用 96 MFCC 特征训练我的模型。我使用了 Librosa，但没有得到有希望的结果。然后我尝试使用 python_speech_features，但是我最多只能获得 26 个特征！为什么！这是同一音频文件的形状

使用Librosa

x = librosa.feature.mfcc(audio, rate, n_mfcc=96)
x.shape  # (96, 204)

使用python_speech_features

mfcc_feature = pySpeech.mfcc(audio, rate, 0.025, 0.01, 96, nfft=1200, appendEnergy = True)
mfcc_feature.shape # output => (471, 26)

任何想法！

Answer 1

因此 librosa 和 python_speech_features 的实现在结构方面甚至在理论上都是不同的。基于文档：

您会注意到输出不同，librosa mfcc output shape = (n_mels, t) 而 python_speech_features output = (num_frames, num_cep)，因此您需要转置两者之一。此外，您还会注意到 python_speech_features 中任何高于 26 的 num_ceps 值都不会改变返回的 mfccs num_ceps 中的任何内容，这是因为您受到使用的过滤器数量的限制。因此，您也必须增加它。此外，您需要确保 framing 使用相似的值（一个使用样本计数和其他持续时间），因此您必须修复它。 python_speech_features 接受由 scipy 读取函数返回的 int16 值，但 librosa 需要 float32，因此您必须转换读取数组或使用 librosa.load()。这是一个包含之前更改的小片段：

import librosa
import numpy as np
import python_speech_features
from scipy.io.wavfile import read


# init fname
fname = "sample.wav"

# read audio 
rate, audio = read(fname)

# using librosa 
lisbrosa_mfcc_feature = librosa.feature.mfcc(y=audio.astype(np.float32), 
                                             sr=rate,
                                             n_mfcc=96,
                                             n_fft=1024,
                                             win_length=int(0.025*rate),                                            
                                             hop_length=int(0.01*rate))
print(lisbrosa_mfcc_feature.T.shape)

# using python_speech_features
psf_mfcc_feature = python_speech_features.mfcc(signal=audio, 
                                               samplerate=rate, 
                                               winlen=0.025,
                                               winstep=0.01, 
                                               numcep=96,
                                               nfilt=96,
                                               nfft=1024, 
                                               appendEnergy=False)
print(psf_mfcc_feature.shape)


# check if size is the same
print(lisbrosa_mfcc_feature.shape == psf_mfcc_feature.shape)

我对此进行了测试，输出如下：

(9003, 96)
(9001, 96)
False

它不是完全相同的输出，但它只是 2 帧 difference.By 值的方式不会相同，因为每个库都使用不同的方法来计算 MFCC，python_speech_features使用 discrete Fourier transform whereas librosa uses short time Fourier transform。

使用 python_speech_features 获取 96 个 MFCC 功能

Getting 96 MFCC features using python_speech_features

python

librosa