Huggingface Load_dataset() function throws "ValueError: Couldn't cast"
Huggingface Load_dataset() function throws "ValueError: Couldn't cast"
我的目标是训练一个能够使用加载的 SlovakBert 模型和 HuggingFace 库以斯洛伐克语进行情感分析的分类器。代码在 Google Colaboratory.
上执行
我的测试数据集是从这个 csv 文件中读取的:
https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
数据有两列:斯洛伐克语句子列和表示句子情感的标签的第二列。标签的值为 -1、0 或 1。
Load_dataset() 函数抛出此错误:
ValueError: Couldn't cast
Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string
-1: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954
to
{'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)}
because column names don't match
代码:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})
加载数据集时出了什么问题?
原因是由于分隔符在第一列中多次使用,代码无法自动确定列数(有时将一个句子分成多列,因为它无法自动确定,
是分隔符还是句子的一部分。
但是,解决方案很简单:(只需添加列名)
dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])
输出:
DatasetDict({
train: Dataset({
features: ['sentence', 'label'],
num_rows: 89
})
test: Dataset({
features: ['sentence', 'label'],
num_rows: 91
})
})
我的目标是训练一个能够使用加载的 SlovakBert 模型和 HuggingFace 库以斯洛伐克语进行情感分析的分类器。代码在 Google Colaboratory.
上执行我的测试数据集是从这个 csv 文件中读取的: https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv
数据有两列:斯洛伐克语句子列和表示句子情感的标签的第二列。标签的值为 -1、0 或 1。
Load_dataset() 函数抛出此错误:
ValueError: Couldn't cast Vrtuľník je veľmi zraniteľný pri dobre mierenej streľbe zo zeme. Brániť sa, unikať, alebo vedieť zneškodniť nepriateľa je vecou sekúnd, ak nie stotín, kedy ide život. : string -1: int64 -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 954 to {'Priestorovo a vybavenim OK.': Value(dtype='string', id=None), '1': Value(dtype='int64', id=None)} because column names don't match
代码:
!pip install transformers==4.10.0 -qqq
!pip install datasets -qqq
from re import M
import numpy as np
from datasets import load_metric, load_dataset, Dataset
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import pandas as pd
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
#links to dataset
test = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_games.csv'
train = 'https://raw.githubusercontent.com/kinit-sk/slovakbert-auxiliary/main/sentiment_reviews/kinit_golden_accomodation.csv'
model_name = 'gerulata/slovakbert'
#Load data
dataset = load_dataset('csv', data_files={'train': train, 'test': test})
加载数据集时出了什么问题?
原因是由于分隔符在第一列中多次使用,代码无法自动确定列数(有时将一个句子分成多列,因为它无法自动确定,
是分隔符还是句子的一部分。
但是,解决方案很简单:(只需添加列名)
dataset = load_dataset('csv', data_files={'train': train,'test':test},column_names=['sentence','label'])
输出:
DatasetDict({
train: Dataset({
features: ['sentence', 'label'],
num_rows: 89
})
test: Dataset({
features: ['sentence', 'label'],
num_rows: 91
})
})