如何截断 Huggingface 管道中的输入?
How to truncate input in the Huggingface pipeline?
我目前使用 huggingface 管道进行情绪分析,如下所示:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', device=0)
问题是当我传递大于 512 个标记的文本时,它会崩溃并提示输入太长。有什么方法可以将 max_length 和截断参数从分词器直接传递到管道吗?
我的解决方法是:
从转换器导入 AutoTokenizer、AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=0)
然后当我调用分词器时:
pt_batch = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")
但是像这样直接调用管道会更好:
classifier(text, padding=True, truncation=True, max_length=512)
这种方式应该可行:
classifier(text, padding=True, truncation=True)
如果它不尝试将分词器加载为:
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)
您可以在推理时使用 tokenizer_kwargs :
model_pipline = pipeline("text-classification",model=model,tokenizer=tokenizer,device=0, return_all_scores=True)
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512,'return_tensors':'pt'}
prediction = model_pipeline('sample text to predict',**tokenizer_kwargs)
有关详细信息,您可以查看此 link
我目前使用 huggingface 管道进行情绪分析,如下所示:
from transformers import pipeline
classifier = pipeline('sentiment-analysis', device=0)
问题是当我传递大于 512 个标记的文本时,它会崩溃并提示输入太长。有什么方法可以将 max_length 和截断参数从分词器直接传递到管道吗?
我的解决方法是:
从转换器导入 AutoTokenizer、AutoModelForSequenceClassification
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, device=0)
然后当我调用分词器时:
pt_batch = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt")
但是像这样直接调用管道会更好:
classifier(text, padding=True, truncation=True, max_length=512)
这种方式应该可行:
classifier(text, padding=True, truncation=True)
如果它不尝试将分词器加载为:
tokenizer = AutoTokenizer.from_pretrained(model_name, model_max_len=512)
您可以在推理时使用 tokenizer_kwargs :
model_pipline = pipeline("text-classification",model=model,tokenizer=tokenizer,device=0, return_all_scores=True)
tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512,'return_tensors':'pt'}
prediction = model_pipeline('sample text to predict',**tokenizer_kwargs)
有关详细信息,您可以查看此 link