BERT 的输入是令牌 ID。如何将相应的输入令牌 VECTOR 获取到 BERT 中？

Question

我是新手，正在学习变形金刚。

在很多BERT教程中，我看到输入只是单词的token id。但是我们肯定需要将这个令牌 ID 转换为向量表示（它可以是一个热编码，或者每个令牌 ID 的任何初始向量表示），以便模型可以使用它。

我的问题是：我在哪里可以找到每个标记的初始向量表示？

Answer 1

在 BERT 中，输入是 string 本身。然后，BERT 设法将其转换为令牌，然后创建其向量。让我们看一个例子：

prep_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
enc_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4' 
bert_preprocess = hub.KerasLayer(prep_url)
bert_encoder = hub.KerasLayer(enc_url)

text = ['Hello I"m new to stack overflow']

# First, you need to preprocess the data

preprocessed_text = bert_preprocess(text)
# this will give you a dict with a few keys such us input_word_ids, that is, the tokenizer

encoded = bert_encoder(preprocessed_text)
# and this will give you the (1, 768) vector with the context value of the previous text. the output is encoded['pooled_output']

# you can play with both dicts, printing its keys()

我建议您转到上面的两个链接并做一些研究。回顾一下，BERT 使用字符串作为输入，然后对其进行标记化（使用自己的标记器！）。如果您想使用相同的值进行标记，则需要相同的 vocab 文件，但对于像您这样的全新开始，这样做就足够了。

BERT 的输入是令牌 ID。如何将相应的输入令牌 VECTOR 获取到 BERT 中？

The inputs into BERT are token IDs. How do I get the corresponding the input token VECTORs into BERT?

nlp

word-embedding

bert-language-model

huggingface-transformers