如何生成最有可能占据给定句子中缺失标记位置的标记列表?
How to generate a list of tokens that are most likely to occupy the place of a missing token in a given sentence?
如何生成最有可能占据给定句子中缺失标记位置的标记列表?
我找到了这个 ,但是,这只会生成 一个 个可能的词,并且 不是一个列表适合句子的词。我试着打印出每个变量,看看他是否已经生成了所有可能的单词,但没有成功。
例如,
>>> sentence = 'Cristiano Ronaldo dos Santos Aveiro GOIH ComM is a Portuguese professional [].' # [] is missing word
>>> generate(sentence)
['soccer', 'basketball', 'tennis', 'rugby']
我刚刚使用 BERT-base-uncased 模型在 model hub of HuggingFace 上尝试了您的示例,它生成了一个可能的标记列表:
我可以写一个 Colab 笔记本来解释如何编写代码。每个神经网络总是输出一个概率分布,所以你可以 return 概率最高的标记。
您基本上可以执行与 中相同的操作,但不是只添加最合适的标记,而是以五个最合适的标记为例:
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
#instead of argmax, we use argsort to sort the tokens which best fit
predicted_index = torch.argsort(predictions[0, i], descending=True)
tokens = []
#the the 5 best fitting tokens and add the to the list
for k in range(5):
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index[k].item()])[0]
tokens.append(predicted_token)
results.append(tokens)
return results
对于你的句子,结果是:[['footballer', 'golfer', 'football', 'cyclist', 'boxer']]
如何生成最有可能占据给定句子中缺失标记位置的标记列表?
我找到了这个
例如,
>>> sentence = 'Cristiano Ronaldo dos Santos Aveiro GOIH ComM is a Portuguese professional [].' # [] is missing word
>>> generate(sentence)
['soccer', 'basketball', 'tennis', 'rugby']
我刚刚使用 BERT-base-uncased 模型在 model hub of HuggingFace 上尝试了您的示例,它生成了一个可能的标记列表:
我可以写一个 Colab 笔记本来解释如何编写代码。每个神经网络总是输出一个概率分布,所以你可以 return 概率最高的标记。
您基本上可以执行与
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
#instead of argmax, we use argsort to sort the tokens which best fit
predicted_index = torch.argsort(predictions[0, i], descending=True)
tokens = []
#the the 5 best fitting tokens and add the to the list
for k in range(5):
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index[k].item()])[0]
tokens.append(predicted_token)
results.append(tokens)
return results
对于你的句子,结果是:[['footballer', 'golfer', 'football', 'cyclist', 'boxer']]