Error : 'utf-8' codec can't decode bytes in position 7526-7527: invalid continuation byte
Error : 'utf-8' codec can't decode bytes in position 7526-7527: invalid continuation byte
我在直接下载 Bert 模型时遇到连接问题(公司的隐私政策)
所以,我在 https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py
下载了 BertTokenizer
并获得了我的模型分词器的 txt 文件。
"bert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
但是当我导入分词器模型时,出现错误。
我的代码:
tokenizer = BertTokenizer.from_pretrained("My BERT MODEL DIRECTORY", do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print (sentences[0])
print (tokenized_texts[0])
错误信息
'utf-8' codec can't decode bytes in position 7526-7527: invalid continuation byte
我尝试像这样 + encoding = 'utf-8', 'cp949'
tokenizer = BertTokenizer.from_pretrained("My BERT MODEL DIRECTORY", encoding = 'uft-8', do_lower_case=False)
但它不起作用..
感谢您提前发表评论。
无法解码您的字符串,因为它已被截断。要么你手动处理错误:
print (sentences[0].decode('utf-8', 'replace') # Replace the invalid characters with ?
print (tokenized_texts[0].decode('utf-8', 'ignore') # Completely remove the invalid characters
或者您在全局注册一个处理程序:
import codecs
codecs.register_error('strict', codecs.lookup_error('surrogateescape'))
我在直接下载 Bert 模型时遇到连接问题(公司的隐私政策) 所以,我在 https://github.com/huggingface/transformers/blob/master/src/transformers/tokenization_bert.py
下载了 BertTokenizer并获得了我的模型分词器的 txt 文件。 "bert-base-multilingual-cased": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased-vocab.txt",
但是当我导入分词器模型时,出现错误。 我的代码:
tokenizer = BertTokenizer.from_pretrained("My BERT MODEL DIRECTORY", do_lower_case=False)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print (sentences[0])
print (tokenized_texts[0])
错误信息
'utf-8' codec can't decode bytes in position 7526-7527: invalid continuation byte
我尝试像这样 + encoding = 'utf-8', 'cp949'
tokenizer = BertTokenizer.from_pretrained("My BERT MODEL DIRECTORY", encoding = 'uft-8', do_lower_case=False)
但它不起作用.. 感谢您提前发表评论。
无法解码您的字符串,因为它已被截断。要么你手动处理错误:
print (sentences[0].decode('utf-8', 'replace') # Replace the invalid characters with ?
print (tokenized_texts[0].decode('utf-8', 'ignore') # Completely remove the invalid characters
或者您在全局注册一个处理程序:
import codecs
codecs.register_error('strict', codecs.lookup_error('surrogateescape'))