如何在推理后将标记化的词转换回原始词?
How to convert tokenized words back to the original ones after inference?
我正在为已经训练好的 NER 模型编写推理脚本,但我在将编码标记(它们的 ID)转换为原始单词时遇到了问题。
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})
# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')
# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
predictions = []
df = df[['_id', input_col]].copy()
dataset = Dataset.from_pandas(df)
# tokenization, padding, truncation:
encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col],
padding='max_length', truncation=True, max_length=512), batched=True)
encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)
encoded_dataset_ids = encoded_dataset['_id']
for batch in dataloader:
output = self.model(**batch)
# decoding predictions and tokens
for i in range(batch['input_ids'].shape[0]):
tags = [self.unique_labels[label_id] for label_id in output[i]]
tokens = [t for t in self.bert_tokenizer.convert_ids_to_tokens(batch['input_ids'][i]) if t != '[PAD]']
...
结果接近我需要的:
# tokens:
['[CLS]', 'am', '##az', '##on', 'and', 'te', '##sla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there', ...]
# tags:
['X', 'B-COMPANY', 'X', 'X', 'O', 'B-COMPANY', 'X', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ...]
如何把'am', '##az', '##on'
和'B-COMPANY', 'X', 'X'
合二为一token/tag?我知道有个方法叫convert_tokens_to_string
在 Tokenizer 中,但它 returns 只是一个大字符串,很难映射到标签。
此致
如果您只想“合并”公司名称,则可以在线性时间内使用纯 Python。
为简洁起见,跳过句子标记 [CLS]
的开头:
tokens = tokens[1:]
tags = tags[1:]
下面的函数将合并公司令牌并适当增加指针:
def merge_company(tokens, tags):
generated_tokens = []
i = 0
while i < len(tags):
if tags[i] == "B-COMPANY":
company_token = [tokens[i]]
for j in range(i + 1, len(tags)):
i += 1
if tags[j] != "X":
break
else:
company_token.append(tokens[j][2:])
generated_tokens.append("".join(company_token))
else:
generated_tokens.append(tokens[i])
i += 1
return generated_tokens
用法非常简单,请注意 tags
也需要删除它们的 X
:
tokens = merge_company(tokens, tags)
tags = [tag for tag in tags if tag != "X"]
这会给你:
['amazon', 'and', 'tesla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there']
['B-COMPANY', 'O', 'B-COMPANY', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
我正在为已经训练好的 NER 模型编写推理脚本,但我在将编码标记(它们的 ID)转换为原始单词时遇到了问题。
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})
# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')
# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
predictions = []
df = df[['_id', input_col]].copy()
dataset = Dataset.from_pandas(df)
# tokenization, padding, truncation:
encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col],
padding='max_length', truncation=True, max_length=512), batched=True)
encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)
encoded_dataset_ids = encoded_dataset['_id']
for batch in dataloader:
output = self.model(**batch)
# decoding predictions and tokens
for i in range(batch['input_ids'].shape[0]):
tags = [self.unique_labels[label_id] for label_id in output[i]]
tokens = [t for t in self.bert_tokenizer.convert_ids_to_tokens(batch['input_ids'][i]) if t != '[PAD]']
...
结果接近我需要的:
# tokens:
['[CLS]', 'am', '##az', '##on', 'and', 'te', '##sla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there', ...]
# tags:
['X', 'B-COMPANY', 'X', 'X', 'O', 'B-COMPANY', 'X', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ...]
如何把'am', '##az', '##on'
和'B-COMPANY', 'X', 'X'
合二为一token/tag?我知道有个方法叫convert_tokens_to_string
在 Tokenizer 中,但它 returns 只是一个大字符串,很难映射到标签。
此致
如果您只想“合并”公司名称,则可以在线性时间内使用纯 Python。
为简洁起见,跳过句子标记 [CLS]
的开头:
tokens = tokens[1:]
tags = tags[1:]
下面的函数将合并公司令牌并适当增加指针:
def merge_company(tokens, tags):
generated_tokens = []
i = 0
while i < len(tags):
if tags[i] == "B-COMPANY":
company_token = [tokens[i]]
for j in range(i + 1, len(tags)):
i += 1
if tags[j] != "X":
break
else:
company_token.append(tokens[j][2:])
generated_tokens.append("".join(company_token))
else:
generated_tokens.append(tokens[i])
i += 1
return generated_tokens
用法非常简单,请注意 tags
也需要删除它们的 X
:
tokens = merge_company(tokens, tags)
tags = [tag for tag in tags if tag != "X"]
这会给你:
['amazon', 'and', 'tesla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there']
['B-COMPANY', 'O', 'B-COMPANY', 'O', 'O', 'O', 'O', 'O', 'O', 'O']