Huggingface Load_dataset() function throws "ValueError: Couldn't cast"

Huggingface Load_dataset() function throws "ValueError: Couldn't cast"

我的目标是训练一个能够使用加载的 SlovakBert 模型和 HuggingFace 库以斯洛伐克语进行情感分析的分类器。代码在 Google Colaboratory.

上执行

我的测试数据集是从这个 csv 文件中读取的: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv

并训练数据集: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv

数据有两列:斯洛伐克语句子列和表示句子情感的标签的第二列。标签的值为 -1、0 或 1。

Load_dataset() 函数抛出此错误:

ValueError: Couldn't cast Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string -1: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954 to {'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)} because column names don't match

代码:

!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq

from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'


model_name = 'gerulata/slovakbert'


#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})

加载数据集时出了什么问题?

原因是由于分隔符在第一列中多次使用,代码无法自动确定列数(有时将一个句子分成多列,因为它无法自动确定,是分隔符还是句子的一部分。

但是,解决方案很简单:(只需添加列名)

dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])

输出:

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label'],
        num_rows: 89
    })
    test: Dataset({
        features: ['sentence', 'label'],
        num_rows: 91
    })
})