在不使用预训练 BERT 的情况下使用 BERT 嵌入语料库（以及保存词汇）

Embedding corpus with BERT (as well saving the vocab) without using pretrained BERT

embedding
corpus
pre-trained-model

像 word2vec / GloVe，我希望从头开始将我的领域特定语料库（大约 1000 万个单词）嵌入 BERT。通过这些嵌入，我可以将它们用于句子相似性（已经使用 SBERT）。但我不想使用任何 预训练 models/data（针对 classification/next 句子预测的微调模型）。

到目前为止，我找不到任何 solutions/approaches 来使用 BERT 嵌入（自己的）语料库，除了这里使用的那些：https://github.com/google-research/bert/blob/master/run_classifier.py

有没有办法做到这一点？谢谢。

我认为您的问题的解决方案已在以下问题中得到解决： https://github.com/google-research/bert/issues/615

还要生成特定领域的词汇表，请参阅此 repo https://github.com/kwonmha/bert-vocab-builder

希望对您有所帮助！

在不使用预训练 BERT 的情况下使用 BERT 嵌入语料库（以及保存词汇）

Embedding corpus with BERT (as well saving the vocab) without using pretrained BERT

embedding

corpus

pre-trained-model