从 tokenizer.encode_plus 返回的字典中缺少 attention_mask

Question

我有一个运行良好的代码库，但今天当我尝试运行时，我发现 tokenizer.encode_plus 停止返回 attention_mask。是在最新版本中删除了吗？或者，我需要做其他事情吗？

以下代码对我有用。

encoded_dict = tokenizer.encode_plus(
                truncated_query,
                span_doc_tokens,
                max_length=max_seq_length,
                return_overflowing_tokens=True,
                pad_to_max_length=True,
                stride=max_seq_length - doc_stride - len(truncated_query) - sequence_pair_added_tokens,
                truncation_strategy="only_second",
                return_token_type_ids=True,
                return_attention_mask=True
            )

但现在，我只从 encode_plus 得到 dict_keys(['input_ids', 'token_type_ids'])。另外，我意识到返回的 input_ids 没有填充到 max_length.

Answer 1

我想通了这个问题。我将分词器 API 更新为最新版本 0.7.0。然而，最新版本的 transformers API 适用于 tokenizers 0.5.2 版本。回滚到 0.5.2 后，问题消失了。使用 pip show，我看到以下内容。

Name: transformers
Version: 2.8.0
Summary: State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch
Home-page: https://github.com/huggingface/transformers

Name: tokenizers
Version: 0.5.2
Summary: Fast and Customizable Tokenizers
Home-page: https://github.com/huggingface/tokenizers

从 tokenizer.encode_plus 返回的字典中缺少 attention_mask

attention_mask is missing in the returned dict from tokenizer.encode_plus

huggingface-transformers

huggingface-tokenizers