Torchtext TabularDataset() 错误地读取数据字段
Torchtext TabularDataset() reads in Datafields incorrectly
目标:我想基于我的自定义数据集创建一个文本分类器,类似(及后续)This (now deleted) 来自 mlexplained 的教程。
发生了什么
我成功地格式化了我的数据,创建了一个训练、验证和测试数据集,并对其进行了格式化,使其等于他们正在使用的“有毒推文”数据集(每个标签都有一列,[= 为 1/0 59=] 正确)。大多数其他部分也按预期工作,但在迭代时出现错误。
The `device` argument should be set by using `torch.device` or passing a string as an argument.
This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
0%| | 0/25517 [00:01<?, ?it/s]
Traceback (most recent call last):
... (trace back messages)
AttributeError: 'Example' object has no attribute 'text'
Traceback 指示的行:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()
epochs = 2
for epoch in range(1, epochs + 1):
running_loss = 0.0
running_corrects = 0
model.train() # turn on training mode
for x, y in tqdm.tqdm(train_dl): # **THIS LINE CONTAINS THE ERROR**
opt.zero_grad()
preds = model(x)
loss = loss_func(y, preds)
loss.backward()
opt.step()
running_loss += loss.data[0] * x.size(0)
epoch_loss = running_loss / len(trn)
# calculate the validation loss for this epoch
val_loss = 0.0
model.eval() # turn on evaluation mode
for x, y in valid_dl:
preds = model(x)
loss = loss_func(y, preds)
val_loss += loss.data[0] * x.size(0)
val_loss /= len(vld)
print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))
尝试解决已经出现的问题,我认为是 Reson:
我知道这个问题发生在其他人身上,这里甚至有 2 个问题,机器人都有跳过数据集中的列或行的问题(我检查了空 lines/Cokumns,并且发现 none)。另一个解决方案是给定模型的参数必须与 .csv 文件中的参数顺序相同(none 缺失)。
但是,相关代码(tst、trn和vld集的加载和创建)
def createTestTrain():
# Create a Tokenizer
tokenize = lambda x: x.split()
# Defining Tag and Text
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)
# Our Datafield
tv_datafields = [("ID", None),
("text", TEXT)]
# Loading our Additional columns we added earlier
with open(PATH + 'columnList.pickle', 'rb') as handle:
addColumns = pickle.load(handle)
# Adding the extra columns, no way we are defining 1000 tags by hand
for column in addColumns:
tv_datafields.append((column, LABEL))
#tv_datafields.append(("split", None))
# Loading Train/Test Split we created
trn = TabularDataset(
path=PATH+'train.csv',
format='csv',
skip_header=True,
fields=tv_datafields)
vld = TabularDataset(
path=PATH+'train.csv',
format='csv',
skip_header=True,
fields=tv_datafields)
# Creating Test Datafield
tst_datafields = [("id", None),
("text", TEXT)]
# Using TabularDataset, as we want to Analyse Text on it
tst = TabularDataset(
path=PATH+"test.csv", # the file path
format='csv',
skip_header=True,
fields=tst_datafields)
return(trn, vld, tst)
使用相同的列表和顺序,就像我的 csv 一样。 tv_datafields 的结构与文件完全一样。此外,由于 Datafield 对象只是带有数据点的字典,我读出了字典的键,就像本教程所做的那样,通过:
trn[0].dict_keys()
应该发生的事情:
这个例子的行为是这样的
trn[0]
torchtext.data.example.Example at 0x10d3ed3c8
trn[0].__dict__.keys()
dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
我的结果:
trn[0].__dict__.keys()
Out[19]: dict_keys([])
trn[1].__dict__.keys()
Out[20]: dict_keys([])
trn[2].__dict__.keys()
Out[21]: dict_keys([])
trn[3].__dict__.keys()
Out[22]: dict_keys(['text'])
虽然 trn[0] 确实不包含任何内容,但它从 3 扩展到 15,通常应该有的列数应该远不止于此。
现在我很茫然,到底哪里做错了。数据适合,函数显然有效,但 TabularDataset() 似乎以错误的方式(如果有的话)读取我的列。我分类了吗
# Defining Tag and Text
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)
走错路了?至少我的 Debuggin 似乎表明了这一点。
由于 Torchtext 上的文档很少,我很难找到它,但是当我查看 Data or Fields 的定义时,我看不出有什么问题。
感谢您的帮助。
我发现我的问题出在哪里,显然 Torchtext 只接受引号中的数据并且只接受“,”作为分隔符。我的数据不在引号内并且有“;”作为分隔符。
目标:我想基于我的自定义数据集创建一个文本分类器,类似(及后续)This (now deleted) 来自 mlexplained 的教程。
发生了什么 我成功地格式化了我的数据,创建了一个训练、验证和测试数据集,并对其进行了格式化,使其等于他们正在使用的“有毒推文”数据集(每个标签都有一列,[= 为 1/0 59=] 正确)。大多数其他部分也按预期工作,但在迭代时出现错误。
The `device` argument should be set by using `torch.device` or passing a string as an argument.
This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.
0%| | 0/25517 [00:01<?, ?it/s]
Traceback (most recent call last):
... (trace back messages)
AttributeError: 'Example' object has no attribute 'text'
Traceback 指示的行:
opt = optim.Adam(model.parameters(), lr=1e-2)
loss_func = nn.BCEWithLogitsLoss()
epochs = 2
for epoch in range(1, epochs + 1):
running_loss = 0.0
running_corrects = 0
model.train() # turn on training mode
for x, y in tqdm.tqdm(train_dl): # **THIS LINE CONTAINS THE ERROR**
opt.zero_grad()
preds = model(x)
loss = loss_func(y, preds)
loss.backward()
opt.step()
running_loss += loss.data[0] * x.size(0)
epoch_loss = running_loss / len(trn)
# calculate the validation loss for this epoch
val_loss = 0.0
model.eval() # turn on evaluation mode
for x, y in valid_dl:
preds = model(x)
loss = loss_func(y, preds)
val_loss += loss.data[0] * x.size(0)
val_loss /= len(vld)
print('Epoch: {}, Training Loss: {:.4f}, Validation Loss: {:.4f}'.format(epoch, epoch_loss, val_loss))
尝试解决已经出现的问题,我认为是 Reson:
我知道这个问题发生在其他人身上,这里甚至有 2 个问题,机器人都有跳过数据集中的列或行的问题(我检查了空 lines/Cokumns,并且发现 none)。另一个解决方案是给定模型的参数必须与 .csv 文件中的参数顺序相同(none 缺失)。
但是,相关代码(tst、trn和vld集的加载和创建) def createTestTrain():
# Create a Tokenizer
tokenize = lambda x: x.split()
# Defining Tag and Text
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)
# Our Datafield
tv_datafields = [("ID", None),
("text", TEXT)]
# Loading our Additional columns we added earlier
with open(PATH + 'columnList.pickle', 'rb') as handle:
addColumns = pickle.load(handle)
# Adding the extra columns, no way we are defining 1000 tags by hand
for column in addColumns:
tv_datafields.append((column, LABEL))
#tv_datafields.append(("split", None))
# Loading Train/Test Split we created
trn = TabularDataset(
path=PATH+'train.csv',
format='csv',
skip_header=True,
fields=tv_datafields)
vld = TabularDataset(
path=PATH+'train.csv',
format='csv',
skip_header=True,
fields=tv_datafields)
# Creating Test Datafield
tst_datafields = [("id", None),
("text", TEXT)]
# Using TabularDataset, as we want to Analyse Text on it
tst = TabularDataset(
path=PATH+"test.csv", # the file path
format='csv',
skip_header=True,
fields=tst_datafields)
return(trn, vld, tst)
使用相同的列表和顺序,就像我的 csv 一样。 tv_datafields 的结构与文件完全一样。此外,由于 Datafield 对象只是带有数据点的字典,我读出了字典的键,就像本教程所做的那样,通过:
trn[0].dict_keys()
应该发生的事情: 这个例子的行为是这样的
trn[0]
torchtext.data.example.Example at 0x10d3ed3c8
trn[0].__dict__.keys()
dict_keys(['comment_text', 'toxic', 'severe_toxic', 'threat', 'obscene', 'insult', 'identity_hate'])
我的结果:
trn[0].__dict__.keys()
Out[19]: dict_keys([])
trn[1].__dict__.keys()
Out[20]: dict_keys([])
trn[2].__dict__.keys()
Out[21]: dict_keys([])
trn[3].__dict__.keys()
Out[22]: dict_keys(['text'])
虽然 trn[0] 确实不包含任何内容,但它从 3 扩展到 15,通常应该有的列数应该远不止于此。
现在我很茫然,到底哪里做错了。数据适合,函数显然有效,但 TabularDataset() 似乎以错误的方式(如果有的话)读取我的列。我分类了吗
# Defining Tag and Text
TEXT = Field(sequential=True, tokenize=tokenize, lower=True)
LABEL = Field(sequential=False, use_vocab=False)
走错路了?至少我的 Debuggin 似乎表明了这一点。
由于 Torchtext 上的文档很少,我很难找到它,但是当我查看 Data or Fields 的定义时,我看不出有什么问题。
感谢您的帮助。
我发现我的问题出在哪里,显然 Torchtext 只接受引号中的数据并且只接受“,”作为分隔符。我的数据不在引号内并且有“;”作为分隔符。