使用 Sphinx4 进行离线语音转文本的质量问题

Question

我想对不断生成的大量 .wav 文件执行语音识别。

越来越多的在线语音转文本 API 服务（例如 Google Cloud Speech, Amazon Lex, Twilio Speech Recognition, Nexmo Voice 等）适用于连接的应用程序，但不适合这种用途由于成本和带宽的原因。

建议快速 google 搜索 CMUSphinx（CMU = 卡内基梅隆大学）在语音识别方面很受欢迎。

我尝试了 'hello world' 示例：

import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class Main {

    public static void main(String[] args) throws IOException {

        Configuration configuration = new Configuration();

        configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");

        StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
        InputStream stream = new FileInputStream(new File("src/main/resources/test.wav"));

        recognizer.startRecognition(stream);
        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
        }
        recognizer.stopRecognition();

    }
}

结果有点令人失望。 “test.wav”文件包含以下音频：

This is the first interval of speaking. After the first moment of silent, this is the second interval of speaking. After the third moment of silence, this the third interval of speaking and the last one.

这被解释为：

this is the first interval speaking ... for the first moment of silence is the second of all speaking ... for the for the moment of silence this is the f***ing several speaking in the last

大部分的词都被捕获了，但是输出的是乱码，以至于失去了意义。然后我下载了一个新闻故事，其中的发音 crystal 清晰，转录完全是胡言乱语。它捕获的内容与一个喝醉了的人听一门外语的内容一样多。

我很想知道是否有人成功地使用了 Sphinx4，如果是，做了哪些调整以使其正常工作？是否有替代 acoustic/language 模型、词典等...性能更好？我应该考虑其他关于离线语音转文本的开源建议吗？

Answer 1

事实证明这是一个小问题，记录在常见问题解答中：“Q: What is sample rate and how does it affect accuracy”

[...] we can not detect sample rate yet. So before using decoder you need to make sure that both sample rate of the decoder matches the sample rate of the input audio and the bandwidth of the audio matches the bandwidth that was used to train the model. A mismatch results in very bad accuracy.

新闻片段是 BBC 音频立体声，以 44.1 khz 录制。

$ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav

Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav'
Channels       : 2
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:29:23.79 = 77783087 samples = 132284 CDDA sectors
File Size      : 311M
Bit Rate       : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

我将其转换为单声道：

$ sox GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav remix 1,2
$ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav

Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:29:23.79 = 77783087 samples = 132284 CDDA sectors
File Size      : 156M
Bit Rate       : 706k
Sample Encoding: 16-bit Signed Integer PCM

然后下采样到 16khz:

$ sox GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav -r 16k GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav
$ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav

Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:29:23.79 = 28220621 samples ~ 132284 CDDA sectors
File Size      : 56.4M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

现在它运行良好。以下是新闻报道中的一段转录音频：

emergency officials said they expect the hall from million people to seek assistance in texas bolton flashy thousand people already being cared for in temporary shelter is on the engine is a big on releasing water from two downs that protect houston city sense of ...

使用 Sphinx4 进行离线语音转文本的质量问题

quality issue with offline voice-to-text using Sphinx4

speech-recognition

speech-to-text

voice-recognition

sphinx4

cmusphinx