如何使用transformers.BertTokenizer编码多个句子？

Question

我想通过使用 transform.BertTokenizer 对多个句子进行编码来创建一个小批量。它似乎适用于一个句子。如何让它适用于多个句子？

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# tokenize a single sentence seems working
tokenizer.encode('this is the first sentence')
>>> [2023, 2003, 1996, 2034, 6251]

# tokenize two sentences
tokenizer.encode(['this is the first sentence', 'another sentence'])
>>> [100, 100] # expecting 7 tokens

Answer 1

变形金刚 >= 4.0.0:
使用 tokenizer 的 __call__ 方法。它将生成一个字典，其中包含 input_ids、token_type_ids 和 attention_mask 作为每个输入句子的列表：

tokenizer(['this is the first sentence', 'another setence'])

输出：

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

变形金刚 < 4.0.0:
使用 tokenizer.batch_encode_plus (documentation)。它将生成一个字典，其中包含 input_ids、token_type_ids 和 attention_mask 作为每个输入句子的列表：

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'])

输出：

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

适用于呼叫和batch_encode_plus：
如果您只想生成 input_ids，则必须将 return_token_type_ids 和 return_attention_mask 设置为 False：

tokenizer.batch_encode_plus(['this is the first sentence', 'another setence'], return_token_type_ids=False, return_attention_mask=False)

输出：

{'input_ids': [[101, 2023, 2003, 1996, 2034, 6251, 102], [101, 2178, 2275, 10127, 102]]}

如何使用transformers.BertTokenizer编码多个句子？

How to encode multiple sentences using transformers.BertTokenizer?

word-embedding

huggingface-transformers

huggingface-tokenizers