通过 AWS SageMaker 使用 Facebook 的 M2M-100 HuggingFace 模型时如何指定 forced_bos_token_id?
How to specify a forced_bos_token_id when using Facebook's M2M-100 HuggingFace model through AWS SageMaker?
model page 提供了关于如何使用模型的代码片段:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_1.2B")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a box of chocolate."
它还提供了一个片段,说明如何将其与 AWS SageMaker 一起部署和使用:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'facebook/m2m100_1.2B',
'HF_TASK':'text2text-generation'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.xlarge' # ec2 instance type
)
predictor.predict({
'inputs': "The answer to the universe is"
})
但是,不清楚如何使用 AWS 设置指定源语言或目标语言。我试过了:
predictor.predict({
'inputs': "The answer to the universe is",
'forced_bos_token_id': "fr"
})
但是我的参数被忽略了。
我还没有找到任何可以解释预期格式的文档 API。
无论如何都需要安装和导入分词器:
pip install transformers
pip install sentencepiece
那么分词器需要通过以下方式传递:
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
predictor.predict({
'inputs': "The answer to the universe is",
'parameters': {
'forced_bos_token_id': tokenizer.get_lang_id("it")
}
})
model page 提供了关于如何使用模型的代码片段:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
hi_text = "जीवन एक चॉकलेट बॉक्स की तरह है।"
chinese_text = "生活就像一盒巧克力。"
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_1.2B")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
# translate Hindi to French
tokenizer.src_lang = "hi"
encoded_hi = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.get_lang_id("fr"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "La vie est comme une boîte de chocolat."
# translate Chinese to English
tokenizer.src_lang = "zh"
encoded_zh = tokenizer(chinese_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_zh, forced_bos_token_id=tokenizer.get_lang_id("en"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Life is like a box of chocolate."
它还提供了一个片段,说明如何将其与 AWS SageMaker 一起部署和使用:
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
# Hub Model configuration. https://huggingface.co/models
hub = {
'HF_MODEL_ID':'facebook/m2m100_1.2B',
'HF_TASK':'text2text-generation'
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
env=hub,
role=role,
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type='ml.m5.xlarge' # ec2 instance type
)
predictor.predict({
'inputs': "The answer to the universe is"
})
但是,不清楚如何使用 AWS 设置指定源语言或目标语言。我试过了:
predictor.predict({
'inputs': "The answer to the universe is",
'forced_bos_token_id': "fr"
})
但是我的参数被忽略了。
我还没有找到任何可以解释预期格式的文档 API。
无论如何都需要安装和导入分词器:
pip install transformers
pip install sentencepiece
那么分词器需要通过以下方式传递:
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")
predictor.predict({
'inputs': "The answer to the universe is",
'parameters': {
'forced_bos_token_id': tokenizer.get_lang_id("it")
}
})