PyTorch 数据加载器显示字符串数据集的奇怪行为
PyTorch dataloader shows odd behavior with string dataset
我正在处理 NLP 问题并且正在使用 PyTorch。 出于某种原因,我的数据加载器返回格式错误的批次。我的输入数据包含句子和整数标签。
句子可以是句子列表,也可以是标记列表。我稍后会在下游组件中将标记转换为整数。
list_labels = [ 0, 1, 0]
# List of sentences.
list_sentences = [ 'the movie is terrible',
'The Film was great.',
'It was just awful.']
# Or list of list of tokens.
list_sentences = [['the', 'movie', 'is', 'terrible'],
['The', 'Film', 'was', 'great.'],
['It', 'was', 'just', 'awful.']]
我创建了以下自定义数据集:
import torch
from torch.utils.data import DataLoader, Dataset
class MyDataset(torch.utils.data.Dataset):
def __init__(self, sentences, labels):
self.sentences = sentences
self.labels = labels
def __getitem__(self, i):
result = {}
result['sentences'] = self.sentences[i]
result['label'] = self.labels[i]
return result
def __len__(self):
return len(self.labels)
当我以句子列表的形式提供输入时,数据加载器正确 returns 批次的完整句子。请注意 batch_size=2
:
list_sentences = [ 'the movie is terrible', 'The Film was great.', 'It was just awful.']
list_labels = [ 0, 1, 0]
dataset = MyDataset(list_sentences, list_labels)
dataloader = DataLoader(dataset, batch_size=2)
batch = next(iter(dataloader))
print(batch)
# {'sentences': ['the movie is terrible', 'The Film was great.'], <-- Great! 2 sentences in batch!
# 'label': tensor([0, 1])}
批次正确包含两个句子和两个标签,因为 batch_size=2
。
但是,当我输入句子作为标记列表的预标记列表时,我得到了奇怪的结果:
list_sentences = [['the', 'movie', 'is', 'terrible'], ['The', 'Film', 'was', 'great.'], ['It', 'was', 'just', 'awful.']]
list_labels = [ 0, 1, 0]
dataset = MyDataset(list_sentences, list_labels)
dataloader = DataLoader(dataset, batch_size=2)
batch = next(iter(dataloader))
print(batch)
# {'sentences': [('the', 'The'), ('movie', 'Film'), ('is', 'was'), ('terrible', 'great.')], <-- WHAT?
# 'label': tensor([0, 1])}
请注意,这批 sentences
是一个包含 个词对元组 的单个列表。 我原以为 sentences
是两个列表的列表,像这样:
{'sentences': [['the', 'movie', 'is', 'terrible'], ['The', 'Film', 'was', 'great.']
这是怎么回事?
此行为是因为默认 collate_fn
在必须整理 list
时执行 following(['sentences']
就是这种情况):
# [...]
elif isinstance(elem, container_abcs.Sequence):
# check to make sure that the elements in batch have consistent size
it = iter(batch)
elem_size = len(next(it))
if not all(len(elem) == elem_size for elem in it):
raise RuntimeError('each element in list of batch should be of equal size')
transposed = zip(*batch)
return [default_collate(samples) for samples in transposed]
出现“问题”是因为在最后两行中,它会递归调用 zip(*batch)
而批处理是 container_abcs.Sequence
(而 list
是),并且 zip
行为是这样的。
如您所见:
batch = [['the', 'movie', 'is', 'terrible'], ['The', 'Film', 'was', 'great.']]
list(zip(*batch))
# [('the', 'The'), ('movie', 'Film'), ('is', 'was'), ('terrible', 'great.')]
除了实施新的整理器并将其传递给 DataLoader(..., collate_fn=mycollator)
之外,我没有看到您的情况的解决方法。例如,一个简单的丑可以是:
def mycollator(batch):
assert all('sentences' in x for x in batch)
assert all('label' in x for x in batch)
return {
'sentences': [x['sentences'] for x in batch],
'label': torch.tensor([x['label'] for x in batch])
}
另一种解决方案是将字符串编码为字节并在 Dataset
中,然后在前向传递中对其进行解码。如果您想包含元数据的字符串(例如数据来自的文件路径),但实际上不需要将数据传递到您的模型中,这将很有用。
例如:
class MyDataset(torch.utils.data.Dataset):
def __next__(self):
return np.array("this is a sentence").bytes()
然后在你的前向传球中你会做:
sentences: List[str] = []
for sentence in batch:
sentences.append(sentence.decode("ascii"))
我正在处理 NLP 问题并且正在使用 PyTorch。 出于某种原因,我的数据加载器返回格式错误的批次。我的输入数据包含句子和整数标签。 句子可以是句子列表,也可以是标记列表。我稍后会在下游组件中将标记转换为整数。
list_labels = [ 0, 1, 0]
# List of sentences.
list_sentences = [ 'the movie is terrible',
'The Film was great.',
'It was just awful.']
# Or list of list of tokens.
list_sentences = [['the', 'movie', 'is', 'terrible'],
['The', 'Film', 'was', 'great.'],
['It', 'was', 'just', 'awful.']]
我创建了以下自定义数据集:
import torch
from torch.utils.data import DataLoader, Dataset
class MyDataset(torch.utils.data.Dataset):
def __init__(self, sentences, labels):
self.sentences = sentences
self.labels = labels
def __getitem__(self, i):
result = {}
result['sentences'] = self.sentences[i]
result['label'] = self.labels[i]
return result
def __len__(self):
return len(self.labels)
当我以句子列表的形式提供输入时,数据加载器正确 returns 批次的完整句子。请注意 batch_size=2
:
list_sentences = [ 'the movie is terrible', 'The Film was great.', 'It was just awful.']
list_labels = [ 0, 1, 0]
dataset = MyDataset(list_sentences, list_labels)
dataloader = DataLoader(dataset, batch_size=2)
batch = next(iter(dataloader))
print(batch)
# {'sentences': ['the movie is terrible', 'The Film was great.'], <-- Great! 2 sentences in batch!
# 'label': tensor([0, 1])}
批次正确包含两个句子和两个标签,因为 batch_size=2
。
但是,当我输入句子作为标记列表的预标记列表时,我得到了奇怪的结果:
list_sentences = [['the', 'movie', 'is', 'terrible'], ['The', 'Film', 'was', 'great.'], ['It', 'was', 'just', 'awful.']]
list_labels = [ 0, 1, 0]
dataset = MyDataset(list_sentences, list_labels)
dataloader = DataLoader(dataset, batch_size=2)
batch = next(iter(dataloader))
print(batch)
# {'sentences': [('the', 'The'), ('movie', 'Film'), ('is', 'was'), ('terrible', 'great.')], <-- WHAT?
# 'label': tensor([0, 1])}
请注意,这批 sentences
是一个包含 个词对元组 的单个列表。 我原以为 sentences
是两个列表的列表,像这样:
{'sentences': [['the', 'movie', 'is', 'terrible'], ['The', 'Film', 'was', 'great.']
这是怎么回事?
此行为是因为默认 collate_fn
在必须整理 list
时执行 following(['sentences']
就是这种情况):
# [...]
elif isinstance(elem, container_abcs.Sequence):
# check to make sure that the elements in batch have consistent size
it = iter(batch)
elem_size = len(next(it))
if not all(len(elem) == elem_size for elem in it):
raise RuntimeError('each element in list of batch should be of equal size')
transposed = zip(*batch)
return [default_collate(samples) for samples in transposed]
出现“问题”是因为在最后两行中,它会递归调用 zip(*batch)
而批处理是 container_abcs.Sequence
(而 list
是),并且 zip
行为是这样的。
如您所见:
batch = [['the', 'movie', 'is', 'terrible'], ['The', 'Film', 'was', 'great.']]
list(zip(*batch))
# [('the', 'The'), ('movie', 'Film'), ('is', 'was'), ('terrible', 'great.')]
除了实施新的整理器并将其传递给 DataLoader(..., collate_fn=mycollator)
之外,我没有看到您的情况的解决方法。例如,一个简单的丑可以是:
def mycollator(batch):
assert all('sentences' in x for x in batch)
assert all('label' in x for x in batch)
return {
'sentences': [x['sentences'] for x in batch],
'label': torch.tensor([x['label'] for x in batch])
}
另一种解决方案是将字符串编码为字节并在 Dataset
中,然后在前向传递中对其进行解码。如果您想包含元数据的字符串(例如数据来自的文件路径),但实际上不需要将数据传递到您的模型中,这将很有用。
例如:
class MyDataset(torch.utils.data.Dataset):
def __next__(self):
return np.array("this is a sentence").bytes()
然后在你的前向传球中你会做:
sentences: List[str] = []
for sentence in batch:
sentences.append(sentence.decode("ascii"))