为什么我的 Sphinx4 识别很差?
Why is my Sphinx4 Recognition poor?
我正在学习如何使用 Eclipse 的 Maven 插件来使用 Sphinx4。
我使用了在 GitHub 上找到的转录演示并对其进行了修改以处理我自己的文件。音频文件为 16 位、单声道、16khz。它大约有 13 秒长。我注意到它听起来像是在慢动作。
文件中所说的话是,"also make sure it's easy for you to access the recording files so you could upload it if asked"。
我正在尝试转录文件,结果很糟糕。我试图找到论坛帖子或链接来彻底解释如何改进结果,或者我没有做正确的事情,但我没有找到任何地方。
我希望加强转录的准确性,但希望避免由于我当前项目必须处理的数据类型的差异而不得不自己训练模型。这不可能吗,我正在使用的代码是否关闭?
代码
(注意:https://instaud.io/8qv 提供音频文件)
public class App {
public static void main(String[] args) throws Exception {
System.out.println("Loading models...");
Configuration configuration = new Configuration();
// Load model from the jar
configuration
.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
// You can also load model from folder
// configuration.setAcousticModelPath("file:en-us");
configuration
.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration
.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.dmp");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(
configuration);
FileInputStream stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/vocaroo_test_revised.wav"));
// stream.skip(44); I commented this out due to the short length of my file
// Simple recognition with generic model
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
// I added the following print statements to get more information
System.out.println("\ngetWords() before loop: " + result.getWords());
System.out.format("Hypothesis: %s\n", result.getHypothesis());
System.out.print("\nThe getResult(): " + result.getResult()
+ "\nThe getLattice(): " + result.getLattice());
System.out.println("List of recognized words and their times:");
for (WordResult r : result.getWords()) {
System.out.println(r);
}
System.out.println("Best 3 hypothesis:");
for (String s : result.getNbest(3))
System.out.println(s);
}
recognizer.stopRecognition();
// Live adaptation to speaker with speaker profiles
stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/warren_test_smaller.wav"));
// stream.skip(44); I commented this out due to the short length of my file
// Stats class is used to collect speaker-specific data
Stats stats = recognizer.createStats(1);
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
stats.collect(result);
}
recognizer.stopRecognition();
// Transform represents the speech profile
Transform transform = stats.createTransform();
recognizer.setTransform(transform);
// Decode again with updated transform
stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/warren_test_smaller.wav"));
// stream.skip(44); I commented this out due to the short length of my file
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
System.out.println("...Printing is done..");
}
}
这是输出(我拍的相册):http://imgur.com/a/Ou9oH
正如 Nikolay 所说,音频听起来很奇怪,可能是因为您没有以正确的方式重新采样。
要将音频从原始 22050 Hz 下采样到所需的 16kHz,您可以 运行 以下命令:
sox Vocaroo.wav -r 16000 Vocaroo16.wav
Vocaroo16.wav 听起来会好很多,它(可能)会给你更好的 ASR 结果。
我正在学习如何使用 Eclipse 的 Maven 插件来使用 Sphinx4。
我使用了在 GitHub 上找到的转录演示并对其进行了修改以处理我自己的文件。音频文件为 16 位、单声道、16khz。它大约有 13 秒长。我注意到它听起来像是在慢动作。
文件中所说的话是,"also make sure it's easy for you to access the recording files so you could upload it if asked"。
我正在尝试转录文件,结果很糟糕。我试图找到论坛帖子或链接来彻底解释如何改进结果,或者我没有做正确的事情,但我没有找到任何地方。
我希望加强转录的准确性,但希望避免由于我当前项目必须处理的数据类型的差异而不得不自己训练模型。这不可能吗,我正在使用的代码是否关闭?
代码
(注意:https://instaud.io/8qv 提供音频文件)
public class App {
public static void main(String[] args) throws Exception {
System.out.println("Loading models...");
Configuration configuration = new Configuration();
// Load model from the jar
configuration
.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
// You can also load model from folder
// configuration.setAcousticModelPath("file:en-us");
configuration
.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration
.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.dmp");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(
configuration);
FileInputStream stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/vocaroo_test_revised.wav"));
// stream.skip(44); I commented this out due to the short length of my file
// Simple recognition with generic model
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
// I added the following print statements to get more information
System.out.println("\ngetWords() before loop: " + result.getWords());
System.out.format("Hypothesis: %s\n", result.getHypothesis());
System.out.print("\nThe getResult(): " + result.getResult()
+ "\nThe getLattice(): " + result.getLattice());
System.out.println("List of recognized words and their times:");
for (WordResult r : result.getWords()) {
System.out.println(r);
}
System.out.println("Best 3 hypothesis:");
for (String s : result.getNbest(3))
System.out.println(s);
}
recognizer.stopRecognition();
// Live adaptation to speaker with speaker profiles
stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/warren_test_smaller.wav"));
// stream.skip(44); I commented this out due to the short length of my file
// Stats class is used to collect speaker-specific data
Stats stats = recognizer.createStats(1);
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
stats.collect(result);
}
recognizer.stopRecognition();
// Transform represents the speech profile
Transform transform = stats.createTransform();
recognizer.setTransform(transform);
// Decode again with updated transform
stream = new FileInputStream(new File("/home/tmscanlan/workspace/example/warren_test_smaller.wav"));
// stream.skip(44); I commented this out due to the short length of my file
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
System.out.println("...Printing is done..");
}
}
这是输出(我拍的相册):http://imgur.com/a/Ou9oH
正如 Nikolay 所说,音频听起来很奇怪,可能是因为您没有以正确的方式重新采样。 要将音频从原始 22050 Hz 下采样到所需的 16kHz,您可以 运行 以下命令:
sox Vocaroo.wav -r 16000 Vocaroo16.wav
Vocaroo16.wav 听起来会好很多,它(可能)会给你更好的 ASR 结果。