transformer模型预训练时，如何给词汇表加词？

Question

鉴于给定语言的 DistilBERT 训练语言模型，取自 Huggingface 中心，我想在特定领域预训练模型，我想添加新词：

原始训练集中肯定不存在
并且不可能通过单词片段 toeknization 来处理 - 基本上你可以将这些单词视为“代码”，它们是命名实体的规范化形式

考虑一下：

我想避免学习新分词器：我可以添加新词，然后让模型通过预训练学习它们的嵌入
“单词”的数量远远大于“股票”词汇表中“未使用”标记的数量

我找到的唯一建议是报告的那个 here:

Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.

你认为这是实现我目标的唯一途径吗？

如果是，我对如何编写此“脚本”一无所知：有人对如何继续（示例代码、文档等）有一些提示吗？

Answer 1

根据我的评论，我假设您使用 pre-trained 检查点，如果只是为了“避免 [学习] 新的分词器。” 此外，该解决方案适用于 PyTorch，它可能更适合此类更改。我还没有检查 Tensorflow（在你的引述之一中提到），所以不能保证它跨平台工作。
为了解决你的问题，让我们把它分成两部分 sub-problems:

将新标记添加到分词器，并且
Re-sizing相应的模型的token嵌入矩阵。

第一个实际上可以通过使用 .add_tokens() 非常简单地实现。我正在引用它的慢速分词器实现（因为它在 Python 中），但据我所知，这也适用于更快的 Rust-based 分词器。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Will return an integer corresponding to the number of added tokens
# The input could also be a list of strings instead of a single string
num_new_tokens = tokenizer.add_tokens("dennlinger")

您可以通过查看编码的输入 ID 快速验证这是否有效：

print(tokenizer("This is dennlinger."))
# 'input_ids': [101, 2023, 2003, 30522, 1012, 102]

索引30522现在对应于我的用户名的新令牌，所以我们可以检查第一部分。但是，如果我们查看 .add_tokens() 的函数文档字符串，它还说：

Note, hen adding new tokens to the vocabulary, you should make sure to also resize the token embedding matrix of the model so that its embedding matrix matches the tokenizer.
In order to do that, please use the PreTrainedModel.resize_token_embeddings method.

查看this particular function，描述有点混乱，但我们可以通过简单地传递先前的模型大小，加上新代币：

from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert-base-uncased")
model.resize_token_embeddings(model.config.vocab_size + num_new_tokens)

# Test that everything worked correctly
model(**tokenizer("This is dennlinger", return_tensors="pt"))

编辑：值得注意的是，.resize_token_embeddings() 也处理任何相关的权重；这意味着，如果你是 pre-training，它还会调整语言建模头的大小（它应该具有相同数量的标记），或者修复会受到标记数量增加影响的绑定权重。

transformer模型预训练时，如何给词汇表加词？

When doing pre-training of a transformer model, how can I add words to the vocabulary?

pytorch

bert-language-model

huggingface-transformers