在管道中使用带有参数的拥抱面转换器

Using Hugging-face transformer with arguments in pipeline

我正在研究如何使用变压器。将 BERT 嵌入到我的输入中的管道。在没有管道的情况下使用它我可以获得恒定的输出但不能使用管道因为我无法将参数传递给它。

如何为我的管道传递与转换器相关的参数?

# These are BERT and tokenizer definitions
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = ['hello world']

# Normally I would do something like this to initialize the tokenizer and get the result with constant output
tokens = tokenizer(inputs,padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model(**tokens)[0].detach().numpy().shape


# using the pipeline 
pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)

# or other option
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT",padding='max_length', truncation=True, max_length = 500, return_tensors="pt")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

nlp=pipeline("feature-extraction", model=model, tokenizer=tokenizer, device=0)

# to call the pipeline
nlp("hello world")

我已经尝试了多种方法,如上面列出的选项,但无法获得输出大小不变的结果。可以通过设置分词器参数来实现恒定的输出大小,但不知道如何为管道提供参数。

有什么想法吗?

不支持 max_length 标记化参数 per default(即不应用 max_length 的填充),但您可以创建自己的 class 并覆盖此行为:

from transformers import AutoTokenizer, AutoModel
from transformers import FeatureExtractionPipeline
from transformers.tokenization_utils import TruncationStrategy

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = ['hello world']

class MyFeatureExtractionPipeline(FeatureExtractionPipeline):
      def _parse_and_tokenize(
        self, inputs, max_length, padding=True, add_special_tokens=True, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs
    ):
        """
        Parse arguments and tokenize
        """
        # Parse arguments
        if getattr(self.tokenizer, "pad_token", None) is None:
            padding = False
        inputs = self.tokenizer(
            inputs,
            add_special_tokens=add_special_tokens,
            return_tensors=self.framework,
            padding=padding,
            truncation=truncation,
            max_length=max_length
        )
        return inputs

mynlp = MyFeatureExtractionPipeline(model=model, tokenizer=tokenizer)
o = mynlp("hello world", max_length = 500, padding='max_length', truncation=True)

让我们比较一下输出的大小:

print(len(o))
print(len(o[0]))
print(len(o[0][0]))

输出:

1
500
768

请注意:这仅适用于变形金刚 4.10.X 和之前的版本。该团队目前正在重构管道 classes,未来的版本将需要进行不同的调整(即,一旦重构管道发布,这将不会起作用)。