如何在 AllenNLP 中将基于 AllenNLP BERT 的语义角色标记更改为 RoBERTa

Question

目前我能够训练 Semantic Role Labeling model using the config file below. This config file is based on the one provided by AllenNLP 并适用于默认的 bert-base-uncased 模型以及 GroNLP/bert-base-dutch-cased.

{
  "dataset_reader": {
    "type": "srl_custom",
    "bert_model_name": "GroNLP/bert-base-dutch-cased"
  },
  "data_loader": {
    "batch_sampler": {
      "type": "bucket",
      "batch_size": 32
    }
  },
  "train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "model": {
    "type": "srl_bert",
    "embedding_dropout": 0.1,
    "bert_model": "GroNLP/bert-base-dutch-cased"
  },
  "trainer": {
    "optimizer": {
      "type": "huggingface_adamw",
      "lr": 5e-5,
      "correct_bias": false,
      "weight_decay": 0.01,
      "parameter_groups": [
        [
          [
            "bias",
            "LayerNorm.bias",
            "LayerNorm.weight",
            "layer_norm.weight"
          ],
          {
            "weight_decay": 0.0
          }
        ]
      ]
    },
    "learning_rate_scheduler": {
      "type": "slanted_triangular"
    },
    "checkpointer": {
      "keep_most_recent_by_count": 2
    },
    "grad_norm": 1.0,
    "num_epochs": 3,
    "validation_metric": "+f1-measure-overall"
  }
}

将 bert_model_name 和 bert_model 参数的值从 GroNLP/bert-base-dutch-cased 交换到 roberta-base 不会开箱即用，因为 SRL 数据读取器 only supports the BertTokenizer而不是 RobertaTokenizer。所以我将配置文件更改为以下内容：

{
  "dataset_reader": {
    "type": "srl_custom",
    "token_indexers": {
      "tokens": {
        "type": "pretrained_transformer",
        "model_name": "roberta-base"
      }
    }
  },
  "data_loader": {
    "batch_sampler": {
      "type": "bucket",
      "batch_size": 32
    }
  },
  "train_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "validation_data_path": "./data/SRL/SONAR_1_SRL/MANUAL500/",
  "model": {
    "type": "srl_bert",
    "embedding_dropout": 0.1,
    "bert_model": "roberta-base"
  },
  "trainer": {
    "optimizer": {
      "type": "huggingface_adamw",
      "lr": 5e-5,
      "correct_bias": false,
      "weight_decay": 0.01,
      "parameter_groups": [
        [
          [
            "bias",
            "LayerNorm.bias",
            "LayerNorm.weight",
            "layer_norm.weight"
          ],
          {
            "weight_decay": 0.0
          }
        ]
      ]
    },
    "learning_rate_scheduler": {
      "type": "slanted_triangular"
    },
    "checkpointer": {
      "keep_most_recent_by_count": 2
    },
    "grad_norm": 1.0,
    "num_epochs": 15,
    "validation_metric": "+f1-measure-overall"
  }
}

但是，这仍然不起作用。我收到以下错误：

2022-02-22 16:19:34,122 - INFO - allennlp.training.gradient_descent_trainer - Training
  0%|          | 0/1546 [00:00<?, ?it/s]2022-02-22 16:19:34,142 - INFO - allennlp.data.samplers.bucket_batch_sampler - No sorting keys given; trying to guess a good one
2022-02-22 16:19:34,142 - INFO - allennlp.data.samplers.bucket_batch_sampler - Using ['tokens'] as the sorting keys
  0%|          | 0/1546 [00:00<?, ?it/s]
2022-02-22 16:19:34,526 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\Scripts\allennlp.exe\__main__.py", line 7, in <module>
    sys.exit(run())
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\__main__.py", line 39, in run
    main(prog="allennlp")
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\__init__.py", line 119, in main
    args.func(args)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 111, in train_model_from_args
    train_model_from_file(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 177, in train_model_from_file
    return train_model(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 258, in train_model
    model = _train_worker(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 508, in _train_worker
    metrics = train_loop.run()
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\commands\train.py", line 581, in run
    return self.trainer.train()
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 771, in train
    metrics, epoch = self._try_train()
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 793, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 510, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp\training\gradient_descent_trainer.py", line 403, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\models\srl_bert.py", line 141, in forward
    bert_embeddings, _ = self.bert_model(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 989, in forward
    embedding_output = self.embeddings(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\transformers\models\bert\modeling_bert.py", line 215, in forward
    token_type_embeddings = self.token_type_embeddings(token_type_ids)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\sparse.py", line 156, in forward
    return F.embedding(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\torch\nn\functional.py", line 1916, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

我不完全明白出了什么问题，也找不到任何关于如何更改配置文件以加载 'custom' BERT/RoBERTa 模型（未提及的模型 here).我是运行默认的 allennlp train config.jsonnet 命令开始训练。 allennlp train config.jsonnet --dry-run 但是不会产生任何错误。

提前致谢！泰斯

编辑： 我现在已经将“srl_bert”替换为自定义“srl_roberta”class 以利用 RobertaModel。然而，这仍然会产生相同的错误。

EDIT2: 我现在按照 Dirk Groeneveld 的建议使用 AutoTokenizer。看起来更改 SrlReader class 以支持基于 RoBERTa 的模型涉及更多更改，例如将 BERT 的单词分词器交换为 RoBERTa 的 BPE 分词器。有没有一种简单的方法来调整 SrlReader class 或者从头开始编写一个新的 RobertaSrlReader 更好？

我继承了 SrlReader class 并将 this line 更改为以下内容：

self.bert_tokenizer = AutoTokenizer.from_pretrained(bert_model_name)

由于 RoBERTa 标记化与 BERT 不同，它会产生以下错误：

  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 255, in text_to_instance
    wordpieces, offsets, start_offsets = self._wordpiece_tokenize_input(
  File "C:\Users\denbe\AppData\Roaming\Python\Python39\site-packages\allennlp_models\structured_prediction\dataset_readers\srl.py", line 196, in _wordpiece_tokenize_input
    word_pieces = self.bert_tokenizer.wordpiece_tokenizer.tokenize(token)
AttributeError: 'RobertaTokenizerFast' object has no attribute 'wordpiece_tokenizer'

Answer 1

解决此问题的最简单方法是修补 SrlReader，使其使用 PretrainedTransformerTokenizer（来自 AllenNLP）或 AutoTokenizer（来自 Huggingface）而不是 BertTokenizer。 SrlReader 是一个旧的 class，并且是针对旧版本的 Huggingface 分词器 API 编写的，因此升级并不容易。

如果您想在 AllenNLP 项目中提交拉取请求，我很乐意帮助您将其合并到 AllenNLP 中！

如何在 AllenNLP 中将基于 AllenNLP BERT 的语义角色标记更改为 RoBERTa

How to change AllenNLP BERT based Semantic Role Labeling to RoBERTa in AllenNLP

srl

allennlp

bert-language-model

roberta-language-model