BERT 中的 TokenEmbeddings 是如何创建的?

How are the TokenEmbeddings in BERT created?

paper describing BERT中,有一段是关于WordPiece Embeddings的。

We use WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary. The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B. As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as C 2 RH, and the final hidden vector for the ith input token as Ti 2 RH. For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings. A visualization of this construction can be seen in Figure 2.

据我了解,WordPiece 将 Words 拆分为 #I #like #swim #ing 之类的词块,但它不会生成嵌入。但是我没有在论文和其他来源中找到任何关于如何生成这些令牌嵌入的信息。他们是否在实际预训练之前进行了预训练?如何?或者它们是随机初始化的?

词块是分开训练的,这样最频繁的词会保持在一起,而不太频繁的词最终会被拆分成字符。

嵌入与 BERT 的其余部分联合训练。反向传播是通过所有层完成的,直到嵌入得到更新,就像网络中的任何其他参数一样。

请注意,只有训练批次中实际存在的标记嵌入得到更新,其余保持不变。这也是为什么您需要相对较小的词块词汇表的原因,这样所有嵌入在训练期间都能足够频繁地更新。