来自 Transformer 的 BERT 句子嵌入

Question

我正在尝试从 BERT 模型中的隐藏状态获取句子向量。查看 huggingface BertModel 说明 here，上面写着：

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt') 
output = model(**encoded_input)

所以首先请注意，正如在网站上显示的那样，/不是/运行。你得到：

>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable

但它看起来像是一个小改动修复了它，因为您不直接调用分词器，而是要求它对输入进行编码：

encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)

好的，除此之外，我得到的张量的形状与我预期的不同：

>>> output[0].shape
torch.Size([1,11,768])

这是很多层。哪一层是用于句子嵌入的正确层？ [0]？ [-1]？平均几个？我的目标是能够与这些进行余弦相似度计算，所以我需要一个合适的 1xN 向量而不是 NxK 张量。

我看到流行的 bert-as-a-service project 似乎使用 [0]

这是正确的吗？是否有关于每一层的文档？

Answer 1

我认为没有单一的权威文档说明使用什么以及何时使用。您需要试验和衡量什么最适合您的任务。这篇论文很好地总结了最近对 BERT 的观察：https://arxiv.org/pdf/2002.12327.pdf.

我认为经验法则是：

如果您要fine-tune 特定任务的模型，请使用最后一层。随时微调，几百个甚至几十个训练样例就够了。
如果无法微调模型，请使用一些中间层（第 7 层或第 8 层）。这背后的直觉是，层首先开发出越来越抽象和通用的输入表示。在某些时候，表示开始更针对 pre-training 任务。

Bert-as-services 默认使用最后一层（但它是可配置的）。在这里，它将是 [:, -1]。但是，它总是 returns 所有输入标记的向量列表。第一个特殊的（so-called [CLS]）token对应的向量被认为是句子嵌入。这就是 [0] 来自您所指的 snipper 的地方。

Answer 2

虽然 Jindrich 的现有答案大体上是正确的，但并未完全解决问题。 OP 问他应该使用哪一层来计算句子嵌入之间的余弦相似度，这个问题的简短答案是 none。像余弦相似度这样的度量要求向量的维度贡献相等且有意义，但原作者发布的 BERT 权重并非如此。雅各布德夫林（BERT论文作者之一）wrote:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

然而，这并不意味着你不能使用 BERT 来完成这样的任务。这只是意味着您不能使用 pre-trained 权重 out-of-the-box。您可以在 BERT 上训练一个分类器，学习哪些句子相似（使用 [CLS] 标记），或者您可以使用 sentence-transformers，它可以在无监督的场景中使用，因为它们经过训练可以产生有意义的结果句子表示。

Answer 3

正如其他答案中提到的，BERT 并不是要生成句子级别的嵌入。现在，让我们研究如何利用 BERT 的强大功能来计算 context-sensitive 句子级嵌入。

BERT 确实携带了单词级别的上下文，这里是一个例子：

这是木头棍子。坚持工作。

以上两个句子带有单词 'stick'，BERT 在根据句子（或者说上下文）计算 stick 的嵌入方面做得很好。

现在，让我们转到另一个示例：

--你几岁了？

--你多大了？

以上两个句子在上下文上非常相似，因此，我们需要一个模型来接受一个句子或文本块或段落，并共同产生正确的嵌入。这是如何实现的。

方法一：

使用pre-trainedsentence_transformers，这里是link到huggingface hub

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim


model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")


embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")


sim_score = cos_sim(embd_a, embd_b)

print(sim_score)

output: tensor([[0.8648]])

现在，可能会有一个问题，即我们如何针对特定领域训练 sentence_transformer。我们开始了，

监督方法：

数据科学家或 MLEngineers 的一个共同挑战是获得正确注释的数据，大多数情况下很难获得大量数据，但是说，如果你有它，我们如何训练我们的 sentence_transformer（别担心，还有一种无监督的方法）。

model = SentenceTransformer('distilbert-base-nli-mean-tokens') 

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

更多详情here。

Tip: If you have a set of sentences that are similar to each other, say, you have a CSV, where column A and B contains sentences similar to each other(I mean each row will have a pair of sentences which are similar to each other), just load the csv and assign random values between 0.85 to 0.95 as similarity score and proceed.

无监督方法

假设您没有大量带注释的数据，但您想要训练特定领域 sentence_transformer，这就是我们的做法。即使是无监督训练，也需要数据，即 sentences/paragraphs 的列表，但不需要注释。比如说，你根本没有任何数据，还有一个工作轮（请访问答案的最后一部分）。

有多种方法可用于无监督训练，列出了两种最突出的方法。要查看所有可用方法的列表，请访问 here.

TSDAE link 研究论文。

from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise", 
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True
)

model.save('output/tsdae-model')

SimCSElink研究论文

from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise",
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]

# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    show_progress_bar=True
)

model.save('output/simcse-model')

Tip: If you carefully observer, major difference is in the loss function used. To see a list of all the loss function applicable to such training scenarios, visit here. Also, with all the experiments I did, I found that TSDAE is more useful, when you want decent precision and good recall. However, SimCSE can be used when you want very high precision and low recall.

现在，如果您没有足够的数据来微调模型，但您找到了在您的域上训练的 BERT 模型，您可以通过添加池化层和密集层直接利用它。请研究什么是 'pooling'，以便更好地了解您在做什么。

from sentence_transformers import SentenceTransformer, models
from torch import nn

word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

Tip: With above approach, if you start getting extreme high cosine score, it is an alarm to do negative testing. Sometime, simply adding pooling layers may not help, you must take few examples and check similarity scores for the inputs that are not similar (it is possible that even for dissimilar sentences, this may show good similarity, and that is the time you should stop and try to collect some data and do unsupervised training)

有兴趣深入了解的人，这里列出了可能对您有所帮助的主题。

池化
暹罗网络
对比损失

:) :)

来自 Transformer 的 BERT 句子嵌入

BERT sentence embeddings from transformers

bert-language-model

huggingface-transformers