用于语音识别的数据是如何收集和准备的?

How is the data used for speech recognition collected and prepared?

据我所知,大多数语音识别实现依赖于二进制文件,其中包含他们尝试使用的语言的声学模型'recognize'。

那么人们是如何编译这些模型的呢?

可以手动转录很多演讲,但这需要很多时间。 即便如此,当给定一个包含一些语音的音频文件和文本文件中的完整转录时,仍然需要以某种方式将各个单词的发音分开。要匹配音频的哪些部分对应于文本,仍然需要语音识别。

这是怎么收集来的?如果有人交出了价值数千小时的音频文件及其完整转录(忽略必须手动转录的问题),如何才能在一个单词结束另一个单词开始的正确间隔内拆分音频?难道生成这些声学模型的软件 已经 必须能够进行语音识别吗?

So how do people compile these models?

您可以通过 CMUSphinx acoustic model training tutorial

了解流程

One could transcribe lots of speeches manually, but that takes a lot of time.

这是正确的,模型准备需要很多时间。语音是手动转录的。您还可以将已经转录的演讲(例如带字幕的电影或转录的讲座或有声读物)用于训练。

Even then, when given an audio file containing some speech and a full transcription of it in a text file, the individual word pronunciations still need to somehow be separated. To match which parts of the audio correspond to the text one still needs speech recognition.

您需要将 5-20 秒长的句子而不是单词分开。语音识别训练可以从称为话语的句子中学习模型,它可以自动对单词进行分割。这种分割是以无监督的方式进行的,本质上是一种聚类,所以它不需要系统识别语音,它只是检测句子中结构相似的块并将它们分配给音素。与训练单独的单词相比,这使得语音训练更容易。

How is this gathered? If one is handed over thousands of hours' worth of audio files and their full transcriptions (disregarding the problem of having to transcribe manually), how can the audio be split up at the right intervals where one word ends and another begins? Wouldn't the software producing these acoustic models already have to be capable of speech recognition?

您需要从一些大小为 50-100 小时的手动转录录音数据库中初始化系统。您可以阅读示例 here. For many popular languages like English, French, German, Russian such databases already exist. For some they are in progress in the dedicated resource.

拥有初始数据库后,您可以拍摄大量视频并使用现有模型对其进行分段。这有助于创建数千小时的数据库。例如这样的数据库是从 Ted 演讲中训练出来的,你可以阅读它 here.