如何应用来自 huggingface 的预训练变压器模型?
How to apply a pretrained transformer model from huggingface?
我有兴趣使用 Huggingface 的预训练模型进行命名实体识别 (NER) 任务,而无需进一步训练或测试模型。
在model page of HuggingFace上,重复使用模型的唯一信息如下:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
我尝试了以下代码,但我得到的是张量输出,而不是每个命名实体的 class 标签。
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
text = "my text for named entity recognition here."
input_ids = torch.tensor(tokenizer.encode(text, padding=True, truncation=True,max_length=50, add_special_tokens = True)).unsqueeze(0)
with torch.no_grad():
output = model(input_ids, output_attentions=True)
关于如何将模型应用于 NER 文本的任何建议?
您要找的命名实体识别pipeline(token分类):
from transformers import AutoTokenizer, pipeline, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModelForTokenClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
nerpipeline = pipeline('ner', model=model, tokenizer=tokenizer)
text = "my text for named entity recognition here."
nerpipeline(text)
输出:
[{'word': 'my',
'score': 0.5209763050079346,
'entity': 'LABEL_0',
'index': 1,
'start': 0,
'end': 2},
{'word': 'text',
'score': 0.5161970257759094,
'entity': 'LABEL_0',
'index': 2,
'start': 3,
'end': 7},
{'word': 'for',
'score': 0.5297629237174988,
'entity': 'LABEL_1',
'index': 3,
'start': 8,
'end': 11},
{'word': 'named',
'score': 0.5258920788764954,
'entity': 'LABEL_1',
'index': 4,
'start': 12,
'end': 17},
{'word': 'entity',
'score': 0.5415489673614502,
'entity': 'LABEL_1',
'index': 5,
'start': 18,
'end': 24},
{'word': 'recognition',
'score': 0.5396601557731628,
'entity': 'LABEL_1',
'index': 6,
'start': 25,
'end': 36},
{'word': 'here',
'score': 0.5165827870368958,
'entity': 'LABEL_0',
'index': 7,
'start': 37,
'end': 41},
{'word': '.',
'score': 0.5266348123550415,
'entity': 'LABEL_0',
'index': 8,
'start': 41,
'end': 42}]
请注意,您需要使用 AutoModelForTokenClassification
而不是 AutoModel
,并且并非所有模型都有经过训练的标记分类头(即,您将获得标记分类头的随机权重)。
我有兴趣使用 Huggingface 的预训练模型进行命名实体识别 (NER) 任务,而无需进一步训练或测试模型。
在model page of HuggingFace上,重复使用模型的唯一信息如下:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
我尝试了以下代码,但我得到的是张量输出,而不是每个命名实体的 class 标签。
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
text = "my text for named entity recognition here."
input_ids = torch.tensor(tokenizer.encode(text, padding=True, truncation=True,max_length=50, add_special_tokens = True)).unsqueeze(0)
with torch.no_grad():
output = model(input_ids, output_attentions=True)
关于如何将模型应用于 NER 文本的任何建议?
您要找的命名实体识别pipeline(token分类):
from transformers import AutoTokenizer, pipeline, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModelForTokenClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
nerpipeline = pipeline('ner', model=model, tokenizer=tokenizer)
text = "my text for named entity recognition here."
nerpipeline(text)
输出:
[{'word': 'my',
'score': 0.5209763050079346,
'entity': 'LABEL_0',
'index': 1,
'start': 0,
'end': 2},
{'word': 'text',
'score': 0.5161970257759094,
'entity': 'LABEL_0',
'index': 2,
'start': 3,
'end': 7},
{'word': 'for',
'score': 0.5297629237174988,
'entity': 'LABEL_1',
'index': 3,
'start': 8,
'end': 11},
{'word': 'named',
'score': 0.5258920788764954,
'entity': 'LABEL_1',
'index': 4,
'start': 12,
'end': 17},
{'word': 'entity',
'score': 0.5415489673614502,
'entity': 'LABEL_1',
'index': 5,
'start': 18,
'end': 24},
{'word': 'recognition',
'score': 0.5396601557731628,
'entity': 'LABEL_1',
'index': 6,
'start': 25,
'end': 36},
{'word': 'here',
'score': 0.5165827870368958,
'entity': 'LABEL_0',
'index': 7,
'start': 37,
'end': 41},
{'word': '.',
'score': 0.5266348123550415,
'entity': 'LABEL_0',
'index': 8,
'start': 41,
'end': 42}]
请注意,您需要使用 AutoModelForTokenClassification
而不是 AutoModel
,并且并非所有模型都有经过训练的标记分类头(即,您将获得标记分类头的随机权重)。