训练用于标记分类的 CamelBERT 模型

Training CamelBERT model for token classification

我正在尝试使用 huggingface 模型(CamelBERT) for token classification using ANERCorp Dataset. I fed the training set from ANERCorp 来训练模型,但出现以下错误。

错误:

Some weights of the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-ca were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at CAMeL-Lab/bert-base-arabic-camelbert-ca and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
03/16/2022 07:31:01 - INFO - utils -   Creating features from dataset file at /content/drive/MyDrive/ANERcorp-CamelLabSplits
03/16/2022 07:31:01 - INFO - utils -   Writing example 0 of 3973
Traceback (most recent call last):
  File "/content/CAMeLBERT/token-classification/run_token_classification.py", line 381, in <module>
    main()
  File "/content/CAMeLBERT/token-classification/run_token_classification.py", line 226, in main
    if training_args.do_train
  File "/content/CAMeLBERT/token-classification/utils.py", line 132, in __init__
    pad_token_label_id=self.pad_token_label_id,
  File "/content/CAMeLBERT/token-classification/utils.py", line 210, in convert_examples_to_features
    label_ids.extend([label_map[label]] +
KeyError: 'B-LOC'

请注意:我正在使用 Google Colab 来训练模型。 代码:

DATA_DIR="/content/drive/MyDrive/ANERcorp-CamelLabSplits"
MAX_LENGTH=512
BERT_MODEL="CAMeL-Lab/bert-base-arabic-camelbert-ca"
OUTPUT_DIR="/content/Output"
BATCH_SIZE=32
NUM_EPOCHS=3
SAVE_STEPS=750
SEED=12345

!python /content/CAMeLBERT/token-classification/run_token_classification.py \
--data_dir $DATA_DIR \
--task_type ner \
--labels $DATA_DIR/train.txt \
--model_name_or_path $BERT_MODEL \
--output_dir $OUTPUT_DIR \
--max_seq_length $MAX_LENGTH \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $BATCH_SIZE \
--save_steps $SAVE_STEPS \
--seed $SEED \
--overwrite_output_dir \
--overwrite_cache \
--do_train \
--do_predict

您正在使用的脚本加载来自 $DATA_DIR/train.txt 的标签。

请参阅 https://github.com/CAMeL-Lab/CAMeLBERT/blob/master/token-classification/run_token_classification.py#L105 了解模型的预期。

然后尝试从语料库中加载标签列表作为第一个文件文件(甚至在加载训练数据之前),参见 https://github.com/CAMeL-Lab/CAMeLBERT/blob/master/token-classification/run_token_classification.py#L183 并将其放入 label_map。

但由于某种原因失败了。我的假设是它没有找到任何东西并且 label_map 是一个空字典,所以第一次尝试从它获取标签失败并出现 KeyError。可能您的输入数据不存在或不在预期的路径中(检查您是否有正确的文件和 $DATA_DIR 的正确值)。根据我的经验,Google 驱动器中的相对路径可能很棘手。尝试一些简单的方法看看它是否有效,例如 os.listdir($DATA_DIR) 看看它是否真的是您期望的那样。

如果这不是问题所在,那么标签可能确实有问题。 ANERCorp 是否使用这种编写标签的确切方式(B-LOC 等)?如果它不同(例如 B-Location 或其他),它也会失败。