通过 AWS SageMaker 使用 Facebook 的 M2M-100 HuggingFace 模型时如何指定 forced_bos_token_id?

How to specify a forced_bos_token_id when using Facebook's M2M-100 HuggingFace model through AWS SageMaker?

model page 提供了关于如何使用模型的代码片段:

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_1.2B")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")

# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."

# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a box of chocolate."

它还提供了一个片段,说明如何将其与 AWS SageMaker 一起部署和使用:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker

role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID':'facebook/m2m100_1.2B',
    'HF_TASK':'text2text-generation'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.6.1',
    pytorch_version='1.7.1',
    py_version='py36',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

predictor.predict({
    'inputs': "The answer to the universe is"
})

但是,不清楚如何使用 AWS 设置指定源语言或目标语言。我试过了:

predictor.predict({
    'inputs': "The answer to the universe is",
    'forced_bos_token_id': "fr"
})

但是我的参数被忽略了。

我还没有找到任何可以解释预期格式的文档 API。

无论如何都需要安装和导入分词器:

pip install transformers
pip install sentencepiece

那么分词器需要通过以下方式传递:

tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
predictor.predict({
    'inputs': "The answer to the universe is",
    'parameters': {
        'forced_bos_token_id': tokenizer.get_lang_id("it")
    }
})