BERT 获取句子嵌入
BERT get sentence embedding
我正在复制 this page 的代码。我已将 BERT 模型下载到我的本地系统并进行句子嵌入。
我有大约 500,000 个句子需要句子嵌入,这会花费很多时间。
- 有没有办法加快这个过程?
- 发送一批句子而不是一次发送一个句子会有帮助吗?
.
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
corpa=["i am a boy","i live in a city"]
storage=[]#list to store all embeddings
for text in corpa:
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]
# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
storage.append((text,sentence_embedding))
######更新 1
我根据提供的答案修改了我的代码。它没有进行完整的批处理
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
print (tokens_tensor)
print (segments_tensors)
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]
# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
print (sentence_embedding[:10])
storage.append((text,sentence_embedding))
我可以将 for 循环的前两行更新到下面。但只有当所有句子在分词后长度相同时它们才有效
tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])
而且在那种情况下 outputs = model(tokens_tensor, segments_tensors)
失败了。
在这种情况下,我如何才能完全执行批处理?
可以加速工作流程的最简单方法之一是批处理数据。在当前的实现中,您在每次迭代中只输入一个句子,但可以使用批处理数据!
现在,如果您愿意自己实现这一部分,我强烈建议您以这种方式使用 tokenizer
来准备您的数据。
batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
[101, 1262, 1330, 5650, 102],
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]]}
不过还有更简单的方法,用FeatureExtractionPipeline
配合综合documentation!这看起来像这样:
from transformers import pipeline
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction(["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"])
更新1
事实上,您稍微更改了代码,但您一次传递一个样本,而不是以批处理形式。如果我们想坚持你的实现批处理将是这样的:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.eval()
sentences = [
"Hello I'm a single sentence",
"And another sentence",
"And the very very last one",
"Hello I'm a single sentence",
"And another sentence",
"And the very very last one",
"Hello I'm a single sentence",
"And another sentence",
"And the very very last one",
]
batch_size = 4
for idx in range(0, len(sentences), batch_size):
batch = sentences[idx : min(len(sentences), idx+batch_size)]
# encoded = tokenizer(batch)
encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
print(outputs.last_hidden_state.size())
输出:
torch.Size([4, 50, 768]) # batch_size * max_length * hidden dim
torch.Size([4, 50, 768])
torch.Size([1, 50, 768])
更新2
关于将批处理数据填充到最大长度的问题,有两个问题。第一,它是否能够用不相关的信息来干扰 transformer 模型? NO,因为在训练阶段模型已经以批处理的形式呈现了变长输入语句,设计者引入了一个特定的参数来引导模型在WHERE 应该注意了!第二,如何摆脱这些垃圾数据?使用attention mask
参数可以只对相关数据进行均值运算!
所以代码会变成这样:
for idx in range(0, len(sentences), batch_size):
batch = sentences[idx : min(len(sentences), idx+batch_size)]
# encoded = tokenizer(batch)
encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
lhs = outputs.last_hidden_state
attention = encoded['attention_mask'].reshape((lhs.size()[0], lhs.size()[1], -1)).expand(-1, -1, 768)
embeddings = torch.mul(lhs, attention)
denominator = torch.count_nonzero(embeddings, dim=1)
summation = torch.sum(embeddings, dim=1)
mean_embeddings = torch.div(summation, denominator)
关于你原来的问题:你无能为力。 BERT 是一种计算要求很高的算法。最好的办法是使用 BertTokenizerFast
而不是常规的 BertTokenizer
。 “快速”版本效率更高,您会看到大量文本的差异。
说到这里,我必须警告你,平均 BERT 词嵌入并不能为句子创建良好的嵌入。参见 this post. From your questions I assume you want to do some kind of semantic similarity search. Try using one of those open-sourced models。
我正在复制 this page 的代码。我已将 BERT 模型下载到我的本地系统并进行句子嵌入。
我有大约 500,000 个句子需要句子嵌入,这会花费很多时间。
- 有没有办法加快这个过程?
- 发送一批句子而不是一次发送一个句子会有帮助吗?
.
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
corpa=["i am a boy","i live in a city"]
storage=[]#list to store all embeddings
for text in corpa:
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]
# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
storage.append((text,sentence_embedding))
######更新 1
我根据提供的答案修改了我的代码。它没有进行完整的批处理
#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
print (tokens_tensor)
print (segments_tensors)
# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers.
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
# Evaluating the model will return a different number of objects based on
# how it's configured in the `from_pretrained` call earlier. In this case,
# becase we set `output_hidden_states = True`, the third item will be the
# hidden states from all layers. See the documentation for more details:
# https://huggingface.co/transformers/model_doc/bert.html#bertmodel
hidden_states = outputs[2]
# `hidden_states` has shape [13 x 1 x 22 x 768]
# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
print (sentence_embedding[:10])
storage.append((text,sentence_embedding))
我可以将 for 循环的前两行更新到下面。但只有当所有句子在分词后长度相同时它们才有效
tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])
而且在那种情况下 outputs = model(tokens_tensor, segments_tensors)
失败了。
在这种情况下,我如何才能完全执行批处理?
可以加速工作流程的最简单方法之一是批处理数据。在当前的实现中,您在每次迭代中只输入一个句子,但可以使用批处理数据!
现在,如果您愿意自己实现这一部分,我强烈建议您以这种方式使用 tokenizer
来准备您的数据。
batch_sentences = ["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
{'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
[101, 1262, 1330, 5650, 102],
[101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 1, 1]]}
不过还有更简单的方法,用FeatureExtractionPipeline
配合综合documentation!这看起来像这样:
from transformers import pipeline
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction(["Hello I'm a single sentence",
"And another sentence",
"And the very very last one"])
更新1 事实上,您稍微更改了代码,但您一次传递一个样本,而不是以批处理形式。如果我们想坚持你的实现批处理将是这样的:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True, # Whether the model returns all hidden-states.
)
model.eval()
sentences = [
"Hello I'm a single sentence",
"And another sentence",
"And the very very last one",
"Hello I'm a single sentence",
"And another sentence",
"And the very very last one",
"Hello I'm a single sentence",
"And another sentence",
"And the very very last one",
]
batch_size = 4
for idx in range(0, len(sentences), batch_size):
batch = sentences[idx : min(len(sentences), idx+batch_size)]
# encoded = tokenizer(batch)
encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
print(outputs.last_hidden_state.size())
输出:
torch.Size([4, 50, 768]) # batch_size * max_length * hidden dim
torch.Size([4, 50, 768])
torch.Size([1, 50, 768])
更新2
关于将批处理数据填充到最大长度的问题,有两个问题。第一,它是否能够用不相关的信息来干扰 transformer 模型? NO,因为在训练阶段模型已经以批处理的形式呈现了变长输入语句,设计者引入了一个特定的参数来引导模型在WHERE 应该注意了!第二,如何摆脱这些垃圾数据?使用attention mask
参数可以只对相关数据进行均值运算!
所以代码会变成这样:
for idx in range(0, len(sentences), batch_size):
batch = sentences[idx : min(len(sentences), idx+batch_size)]
# encoded = tokenizer(batch)
encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
encoded = {key:torch.LongTensor(value) for key, value in encoded.items()}
with torch.no_grad():
outputs = model(**encoded)
lhs = outputs.last_hidden_state
attention = encoded['attention_mask'].reshape((lhs.size()[0], lhs.size()[1], -1)).expand(-1, -1, 768)
embeddings = torch.mul(lhs, attention)
denominator = torch.count_nonzero(embeddings, dim=1)
summation = torch.sum(embeddings, dim=1)
mean_embeddings = torch.div(summation, denominator)
关于你原来的问题:你无能为力。 BERT 是一种计算要求很高的算法。最好的办法是使用 BertTokenizerFast
而不是常规的 BertTokenizer
。 “快速”版本效率更高,您会看到大量文本的差异。
说到这里,我必须警告你,平均 BERT 词嵌入并不能为句子创建良好的嵌入。参见 this post. From your questions I assume you want to do some kind of semantic similarity search. Try using one of those open-sourced models。