我正在使用 bert 预训练模型进行问答。它返回正确的结果,但文本之间有很多空格

I'm using bert pre-trained model for question and answering. It's returning correct result but with lot of spaces between the text

我正在使用 bert 预训练模型进行问答。它返回正确的结果,但文本之间有很多空格

代码如下:

def get_answer_using_bert(question, reference_text):
  
  bert_model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

  bert_tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

  input_ids = bert_tokenizer.encode(question, reference_text)
  input_tokens = bert_tokenizer.convert_ids_to_tokens(input_ids)

  sep_location = input_ids.index(bert_tokenizer.sep_token_id)
  first_seg_len, second_seg_len = sep_location + 1, len(input_ids) - (sep_location + 1)
  seg_embedding = [0] * first_seg_len + [1] * second_seg_len

  model_scores = bert_model(torch.tensor([input_ids]), 
  token_type_ids=torch.tensor([seg_embedding]))
  ans_start_loc, ans_end_loc = torch.argmax(model_scores[0]), torch.argmax(model_scores[1])
  result = ' '.join(input_tokens[ans_start_loc:ans_end_loc + 1])

  result = result.replace('#', '')
  return result

后跟以下代码:

reference_text = 'Mukesh Dhirubhai Ambani was born on 19 April 1957 in the British Crown colony of Aden (present-day Yemen) to Dhirubhai Ambani and Kokilaben Ambani. He has a younger brother Anil Ambani and two sisters, Nina Bhadrashyam Kothari and Dipti Dattaraj Salgaonkar. Ambani lived only briefly in Yemen, because his father decided to move back to India in 1958 to start a trading business that focused on spices and textiles. The latter was originally named Vimal but later changed to Only Vimal His family lived in a modest two-bedroom apartment in Bhuleshwar, Mumbai until the 1970s. The family financial status slightly improved when they moved to India but Ambani still lived in a communal society, used public transportation, and never received an allowance. Dhirubhai later purchased a 14-floor apartment block called Sea Wind in Colaba, where, until recently, Ambani and his brother lived with their families on different floors.'
question = 'What is the name of mukesh ambani brother?'

get_answer_using_bert(question, reference_text)

输出为:

'an il am ban i'

谁能帮我解决这个问题。这真的很有帮助。

您可以只使用分词器 decode 函数:

bert_tokenizer.decode(input_ids[ans_start_loc:ans_end_loc +1])

输出:

'anil ambani'

如果你不想使用解码,你可以使用:

result.replace(' ##', '')