Huggingface 保存分词器
Huggingface saving tokenizer
我正在尝试将分词器保存在 huggingface 中,以便稍后可以从不需要访问互联网的容器中加载它。
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
然而,最后一行给出了错误:
OSError: Can't load config for './models/tokenizer3/'. Make sure that:
- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer3/' is the correct path to a directory containing a config.json file
变形金刚版本:3.1.0
不幸的是没有帮助。
编辑 1
感谢@ashwin 下面的回答,我尝试了 save_pretrained
,但出现以下错误:
OSError: Can't load config for './models/tokenizer/'. Make sure that:
- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer/' is the correct path to a directory containing a config.json file
分词器文件夹的内容如下:
我尝试将 tokenizer_config.json
重命名为 config.json
,然后出现错误:
ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
save_vocabulary()
,只保存tokenizer的词汇文件(List of BPE tokens)。
要保存整个分词器,您应该使用save_pretrained()
因此,如下:
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
编辑:
出于某种未知原因:
而不是
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
使用
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
有效。
重命名“tokenizer_config.json”文件——由 save_pretrained() 函数创建的文件——重命名为“config.json”解决了我环境中的相同问题。
您需要将模型和分词器保存在同一目录中。 HuggingFace 实际上是在寻找您模型的 config.json 文件,因此重命名 tokenizer_config.json 不会解决问题
我正在尝试将分词器保存在 huggingface 中,以便稍后可以从不需要访问互联网的容器中加载它。
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_vocabulary("./models/tokenizer/")
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
然而,最后一行给出了错误:
OSError: Can't load config for './models/tokenizer3/'. Make sure that:
- './models/tokenizer3/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer3/' is the correct path to a directory containing a config.json file
变形金刚版本:3.1.0
编辑 1
感谢@ashwin 下面的回答,我尝试了 save_pretrained
,但出现以下错误:
OSError: Can't load config for './models/tokenizer/'. Make sure that:
- './models/tokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or './models/tokenizer/' is the correct path to a directory containing a config.json file
分词器文件夹的内容如下:
我尝试将 tokenizer_config.json
重命名为 config.json
,然后出现错误:
ValueError: Unrecognized model in ./models/tokenizer/. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: retribert, t5, mobilebert, distilbert, albert, camembert, xlm-roberta, pegasus, marian, mbart, bart, reformer, longformer, roberta, flaubert, bert, openai-gpt, gpt2, transfo-xl, xlnet, xlm, ctrl, electra, encoder-decoder
save_vocabulary()
,只保存tokenizer的词汇文件(List of BPE tokens)。
要保存整个分词器,您应该使用save_pretrained()
因此,如下:
BASE_MODEL = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.save_pretrained("./models/tokenizer/")
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
编辑:
出于某种未知原因: 而不是
tokenizer2 = AutoTokenizer.from_pretrained("./models/tokenizer/")
使用
tokenizer2 = DistilBertTokenizer.from_pretrained("./models/tokenizer/")
有效。
重命名“tokenizer_config.json”文件——由 save_pretrained() 函数创建的文件——重命名为“config.json”解决了我环境中的相同问题。
您需要将模型和分词器保存在同一目录中。 HuggingFace 实际上是在寻找您模型的 config.json 文件,因此重命名 tokenizer_config.json 不会解决问题