CMU Sphinx 数字识别

Question

识别专家您好，

我有很多包含数字 (0 - 9) 的 mp3 文件（原始音频流采样率为 11.025 kHz）。

不同的说话者 (male/female) 例如说 "One"、"Seven"、"Three" 等，中间有停顿（~ 2 - 2.5 秒）

我打算使用 CMU Sphinx 来识别语音（桌面应用程序）。所以我有一些问题：

MP3 解码: 我如何解码我的 mp3 文件意味着什么采样率我应该指定给 ffmpeg（据我所知，不推荐它 upsample/downsample 流）。我应该在解码时过滤噪声 and/or 频段吗？
声学模型：如果我不upsample/downsample流，怎么能我找到一个支持 11025 kHz 的声学模型。如果我做，最好的数字模型是什么？
识别模式: 我发现转录有两种模式 - Key 发现和识别。考虑哪种模式会更好帐户我只有数字（和一些噪音）

谢谢

UPD:

尼古拉，谢谢你的回答。我已经尝试了您的建议 - 有效！

如果您不介意，我想再问一些问题：

我发现其中一个 voxforge 声学模型比 en-us-8khz 更准确。可以吗？
只有 45% 的文件被识别正确。其他 55% 有 20-90% 的错误。因此我的问题是：是否有可能估计所获得结果的置信度？例如，我可以跳过 "not surely" recognized?
如果答案 2 是 "no"，您有什么建议可以提高准确性？我知道，这个问题很抽象...

提前致谢！

UPD2:

顺便说一句，最好的参数设置（我刚刚遍历了各种参数）是：

-remove_dc yes -remove_noise no -vad_threshold 3.4 -vad_prespeech 19 -vad_postspeech 37 -silprob 2.5

Answer 1

MP3 decoding: How do I decode my mp3 files meaning what samplerate should I specify to ffmpeg (as I know it's not recomended to upsample/downsample streams). Should I filter noises and/or frequency bands while decoding?

 ffmpeg -i file.mp3 -ar 8000 file.wav

Acoustic models: If I don't upsample/downsample the stream, how can I find an acoustic model supporting 11025 kHz. If I do, what is the best model for digits?

en-us-8khz 可以下载，你需要像tutorial那样创建一个数字语法，然后按照下面的方式使用它

 pocketsphinx_continuous -infile file.wav -jsgf digits.gram -hmm en-us-8khz -samprate 8000

Recognition mode: I found there are two modes for transcribing - Key spotting and Recognition. Whichmode would be better taking into account I have only digits (and some noise)

识别模式

CMU Sphinx 数字识别

Digits recognition with CMU Sphinx

speech-recognition

voice-recognition

cmusphinx

pocketsphinx