来自 HuggingFace 的 BertWordPieceTokenizer 与 BertTokenizer
BertWordPieceTokenizer vs BertTokenizer from HuggingFace
我有以下代码片段并试图理解 BertWordPieceTokenizer 和 BertTokenizer 之间的区别。
BertWordPieceTokenizer(基于 Rust)
from tokenizers import BertWordPieceTokenizer
sequence = "Hello, y'all! How are you Tokenizer ?"
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
>>>Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
print(tokenized_sequence.tokens)
>>>['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', 'token', '##izer', '[UNK]', '?', '[SEP]']
BertTokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer("bert-base-cased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
#Output: [19082, 117, 194, 112, 1155, 106, 1293, 1132, 1128, 22559, 17260, 100, 136]
- 为什么编码在两者中的工作方式不同?在 BertWordPieceTokenizer 中,它提供编码对象,而在 BertTokenizer 中,它提供词汇表的 ID。
- BertWordPieceTokenizer 和 BertTokenizer 从根本上有什么区别,因为据我了解,BertTokenizer 在后台也使用 WordPiece。
谢谢
当您使用相同的词汇表时,它们应该产生相同的输出(在您的示例中,您使用了 bert-base-uncased-vocab.txt 和 bert-base-cased-vocab.txt)。主要区别在于 tokenizers package are faster as the tokenizers from transformers 中的分词器,因为它们是在 Rust 中实现的。
当您修改您的示例时,您会看到它们生成相同的 ids
和其他属性(编码对象),而 transformers 分词器仅生成 ids
:[=21 的列表=]
from tokenizers import BertWordPieceTokenizer
sequence = "Hello, y'all! How are you Tokenizer ?"
tokenizerBW = BertWordPieceTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))
print(tokenized_sequenceBW.ids)
输出:
Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
<class 'Encoding'>
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
from transformers import BertTokenizer
tokenizerBT = BertTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBT = tokenizerBT.encode(sequence)
print(tokenized_sequenceBT)
print(type(tokenized_sequenceBT))
输出:
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>
您在评论中提到您的问题更多是关于为什么产生的输出不同。据我所知,这是开发人员做出的设计决定,没有具体原因。 tokenizers is an in-place replacement for the BertTokenizer from transformers. They still use a wrapper to make it compatible with with the transformers tokenizer API. There is a BertTokenizerFast class which has a "clean up" method _convert_encoding 中的 BertWordPieceTokenizer 也不是使 BertWordPieceTokenizer 完全兼容的情况。因此,您必须将上面的 BertTokenizer 示例与以下示例进行比较:
from transformers import BertTokenizerFast
sequence = "Hello, y'all! How are you Tokenizer ?"
tokenizerBW = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))
输出:
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>
从我的角度来看,他们使用 objective 构建了 tokenizers library independently from the transformers 库,既快速又实用。
我有以下代码片段并试图理解 BertWordPieceTokenizer 和 BertTokenizer 之间的区别。
BertWordPieceTokenizer(基于 Rust)
from tokenizers import BertWordPieceTokenizer
sequence = "Hello, y'all! How are you Tokenizer ?"
tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
>>>Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
print(tokenized_sequence.tokens)
>>>['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', 'token', '##izer', '[UNK]', '?', '[SEP]']
BertTokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer("bert-base-cased-vocab.txt")
tokenized_sequence = tokenizer.encode(sequence)
print(tokenized_sequence)
#Output: [19082, 117, 194, 112, 1155, 106, 1293, 1132, 1128, 22559, 17260, 100, 136]
- 为什么编码在两者中的工作方式不同?在 BertWordPieceTokenizer 中,它提供编码对象,而在 BertTokenizer 中,它提供词汇表的 ID。
- BertWordPieceTokenizer 和 BertTokenizer 从根本上有什么区别,因为据我了解,BertTokenizer 在后台也使用 WordPiece。
谢谢
当您使用相同的词汇表时,它们应该产生相同的输出(在您的示例中,您使用了 bert-base-uncased-vocab.txt 和 bert-base-cased-vocab.txt)。主要区别在于 tokenizers package are faster as the tokenizers from transformers 中的分词器,因为它们是在 Rust 中实现的。
当您修改您的示例时,您会看到它们生成相同的 ids
和其他属性(编码对象),而 transformers 分词器仅生成 ids
:[=21 的列表=]
from tokenizers import BertWordPieceTokenizer
sequence = "Hello, y'all! How are you Tokenizer ?"
tokenizerBW = BertWordPieceTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))
print(tokenized_sequenceBW.ids)
输出:
Encoding(num_tokens=15, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])
<class 'Encoding'>
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
from transformers import BertTokenizer
tokenizerBT = BertTokenizer("/content/bert-base-uncased-vocab.txt")
tokenized_sequenceBT = tokenizerBT.encode(sequence)
print(tokenized_sequenceBT)
print(type(tokenized_sequenceBT))
输出:
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>
您在评论中提到您的问题更多是关于为什么产生的输出不同。据我所知,这是开发人员做出的设计决定,没有具体原因。 tokenizers is an in-place replacement for the BertTokenizer from transformers. They still use a wrapper to make it compatible with with the transformers tokenizer API. There is a BertTokenizerFast class which has a "clean up" method _convert_encoding 中的 BertWordPieceTokenizer 也不是使 BertWordPieceTokenizer 完全兼容的情况。因此,您必须将上面的 BertTokenizer 示例与以下示例进行比较:
from transformers import BertTokenizerFast
sequence = "Hello, y'all! How are you Tokenizer ?"
tokenizerBW = BertTokenizerFast.from_pretrained("bert-base-uncased")
tokenized_sequenceBW = tokenizerBW.encode(sequence)
print(tokenized_sequenceBW)
print(type(tokenized_sequenceBW))
输出:
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 19204, 17629, 100, 1029, 102]
<class 'list'>
从我的角度来看,他们使用 objective 构建了 tokenizers library independently from the transformers 库,既快速又实用。