了解神经网络与隐马尔可夫模型之间的关系

Understanding relation between Neural Networks and Hidden Markov Model

我已经发表了几篇关于基于 神经网络 、高斯混合模型和 隐马尔可夫模型 的语音识别的论文。在我的研究中，我看到了 George E. Dahl、Dong Yu 等人的论文 "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition"。我想我理解了大部分提出的想法，但我仍然无法理解一些细节。如果有人能启发我，我将不胜感激。

据我了解，程序由三个要素组成：

输入音频流按 10 毫秒的帧拆分并由 MFCC 处理，输出特征向量。
DNN 神经网络将特征向量作为输入，对特征进行处理，使得每一帧(phone)都是可区分的或而是在上下文中给出 phone 的代表。
HMM HMM是一个状态模型，其中每个状态代表一个tri-phone。每个状态都有一定的概率改变到所有其他状态。现在 DNN 的输出层产生一个特征向量，它告诉当前状态它接下来必须改变到哪个状态。

我不明白：输出层（DNN）的特征如何映射到状态的概率。 HMM 是如何创建的？我从哪里获得有关概率的所有信息？

我不需要了解每个细节，基本概念就足够了。我只需要保证，我对这个过程的基本想法是正确的。

On my research, I came across the paper "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition" from George E. Dahl, Dong Yu, et al.. I think I understand the most of the presented idea, however I still have trouble with some details.

最好看教科书，而不是研究论文。

so that each frame(phone) is distinguishable or rather gives a represents of the phone in context.

这句话没有明确的意思，说明你自己不太确定。 DNN 采用框架特征并生成状态的概率。

HMM The HMM is a is a state model, in which each state represents a tri-phone.

没必要三phone。通常有并列的triphones，这意味着几个triphones对应于某个状态。

Now the output layer of the DNN produces a feature vector

不，DNN 为当前帧生成状态概率，它不生成特征向量。

that tells the current state to which state it has to change next.

否，下一个状态是根据当前状态和 DNN 概率通过 HMM Viterbi 算法选择的。 DNN 本身并不能决定下一个状态。

What I don't get: How are the features of the output layer(DNN) mapped to the probabilities of the state.

输出层产生概率。它表示此帧中的 phone A 的概率为 0.9，而此帧中的 phone B 的概率为 0.1

And how is the HMM created in the first place?

与不使用 HMM 的端到端系统不同，HMM 通常在 DNN 初始化之前使用 HMM/GMM 系统和 Baum-Welch 算法进行训练。因此，您首先使用 Baum-Welch 训练 GMM/HMM，然后训练 DNN 以改进 GMM。

Where do I get all the Information about the probabilietes?

你最后一个问题很难理解。

了解神经网络与隐马尔可夫模型之间的关系

Understanding relation between Neural Networks and Hidden Markov Model

speech-recognition

neural-network

hidden-markov-models