运行 使用 Google Colab 通过 Pytorch 中的数据加载器
Running through a dataloader in Pytorch using Google Colab
我正在尝试使用 Pytorch 对猫狗图像数据集进行 运行 分类。到目前为止,在我的代码中,我正在下载数据并进入文件夹 train,其中有两个名为 "cats" 和 "dogs." 的文件夹,然后我试图将这些数据加载到数据加载器中并批量迭代,但它给了我一些我在迭代步骤中不理解的错误。
因为它是 Google Colabs,所以我在其中包含用于下载数据和安装库的代码。到目前为止对我的代码的任何其他建议也将不胜感激。
!pip install torch
!pip install torchvision
from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
# For showing and formatting images
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# For importing datasets into pytorch
import torchvision.datasets as dataset
# Used for dataloaders
import torch.utils.data as data
# For pretrained resnet34 model
import torchvision.models as models
# For optimisation function
import torch.nn as nn
import torch.optim as optim
!wget http://files.fast.ai/data/dogscats.zip
!unzip dogscats.zip
batch_size = 256
train_raw = dataset.ImageFolder(PATH+"train", transform=transforms.ToTensor())
train_loader = data.DataLoader(train_raw, batch_size=batch_size, shuffle=True)
for batch_idx, (data, target) in enumerate(train_loader):
print("Data: ", batch_idx)
错误出现在最后几行,如下所示:
RuntimeErrorTraceback (most recent call last)
<ipython-input-66-c32dd0c1b880> in <module>()
----> 1 for batch_idx, (data, target) in enumerate(train_loader):
2 print("Data: ", batch_idx)
3
/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in __next__(self)
257 if self.num_workers == 0: # same-process loading
258 indices = next(self.sample_iter) # may raise StopIteration
--> 259 batch = self.collate_fn([self.dataset[i] for i in indices])
260 if self.pin_memory:
261 batch = pin_memory_batch(batch)
/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in default_collate(batch)
133 elif isinstance(batch[0], collections.Sequence):
134 transposed = zip(*batch)
--> 135 return [default_collate(samples) for samples in transposed]
136
137 raise TypeError((error_msg.format(type(batch[0]))))
/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in default_collate(batch)
110 storage = batch[0].storage()._new_shared(numel)
111 out = batch[0].new(storage)
--> 112 return torch.stack(batch, 0, out=out)
113 elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
114 and elem_type.__name__ != 'string_':
/usr/local/lib/python2.7/dist-packages/torch/functional.pyc in stack(sequence, dim, out)
62 inputs = [t.unsqueeze(dim) for t in sequence]
63 if out is None:
---> 64 return torch.cat(inputs, dim)
65 else:
66 return torch.cat(inputs, dim, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 400 and 487 in dimension 2 at /pytorch/torch/lib/TH/generic/THTensorMath.c:2897
谢谢
我在您的代码中看到两个问题,首先您将 import torch.utils.data 作为数据导入,然后再次在数据加载器中替换它。请将导入的模块和您的变量名保存在单独的命名空间中。我认为这个错误可能是因为 dataloder(images) 和标签返回的数据大小不同。如您所见,连接中存在错误,因为第一个维度即。文件夹中的标签大小和图像数量不匹配。希望这有帮助。
我认为我对 Manoj Acharya 的评论是错误的,问题出在将 batch_size 放入数据加载器中。我阅读了以下来源,似乎您无法将不同尺寸的图像批量处理:
https://medium.com/@yvanscher/pytorch-tip-yielding-image-sizes-6a776eb4115b
所以在我的代码中更改数据变量后 Manoj 指出我将 batch_size 更改为 1 并且程序停止失败。不过我想分批处理,所以我添加了一个进一步的转换 CenterCrop() 以将所有图像调整为相同大小。下面是我的新代码:
!pip install torch
!pip install torchvision
from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
# For showing and formatting images
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# For importing datasets into pytorch
import torchvision.datasets as dataset
# Used for dataloaders
from torch.utils.data import DataLoader
# For pretrained resnet34 model
import torchvision.models as models
# For optimisation function
import torch.nn as nn
import torch.optim as optim
# For turning data into tensors
import torchvision.transforms as transforms
!wget http://files.fast.ai/data/dogscats.zip
!unzip dogscats.zip
batch_size = 256
sz = 224
train_raw = dataset.ImageFolder(PATH+"train", transform=transforms.Compose([transforms.CenterCrop(sz),transforms.ToTensor()]))
train_loader = DataLoader(train_raw,batch_size=batch_size, shuffle=True)
for batch_idx, (data, target) in enumerate(train_loader):
print("Data: ", batch_idx)
谢谢
我认为主要问题是图像大小不同。我可能以其他方式理解 ImageFolder 但是,我认为如果目录结构如 pytorch 中指定的那样,你不需要图像标签,并且 pytorch 会为你找出标签。
我还会向您的转换添加更多内容,自动调整文件夹中每个图像的大小,例如:
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
transform = transforms.Compose(
[transforms.ToTensor(),transforms.Resize((224,224)),
normalize])
您还可以使用其他技巧来使您的 DataLoader 更快,例如添加 batch_size 和 cpu worker 的数量,例如:
testloader = DataLoader(testset, batch_size=16,
shuffle=False, num_workers=4)
我认为这会让你的流水线更快。
我正在尝试使用 Pytorch 对猫狗图像数据集进行 运行 分类。到目前为止,在我的代码中,我正在下载数据并进入文件夹 train,其中有两个名为 "cats" 和 "dogs." 的文件夹,然后我试图将这些数据加载到数据加载器中并批量迭代,但它给了我一些我在迭代步骤中不理解的错误。
因为它是 Google Colabs,所以我在其中包含用于下载数据和安装库的代码。到目前为止对我的代码的任何其他建议也将不胜感激。
!pip install torch
!pip install torchvision
from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
# For showing and formatting images
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# For importing datasets into pytorch
import torchvision.datasets as dataset
# Used for dataloaders
import torch.utils.data as data
# For pretrained resnet34 model
import torchvision.models as models
# For optimisation function
import torch.nn as nn
import torch.optim as optim
!wget http://files.fast.ai/data/dogscats.zip
!unzip dogscats.zip
batch_size = 256
train_raw = dataset.ImageFolder(PATH+"train", transform=transforms.ToTensor())
train_loader = data.DataLoader(train_raw, batch_size=batch_size, shuffle=True)
for batch_idx, (data, target) in enumerate(train_loader):
print("Data: ", batch_idx)
错误出现在最后几行,如下所示:
RuntimeErrorTraceback (most recent call last)
<ipython-input-66-c32dd0c1b880> in <module>()
----> 1 for batch_idx, (data, target) in enumerate(train_loader):
2 print("Data: ", batch_idx)
3
/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in __next__(self)
257 if self.num_workers == 0: # same-process loading
258 indices = next(self.sample_iter) # may raise StopIteration
--> 259 batch = self.collate_fn([self.dataset[i] for i in indices])
260 if self.pin_memory:
261 batch = pin_memory_batch(batch)
/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in default_collate(batch)
133 elif isinstance(batch[0], collections.Sequence):
134 transposed = zip(*batch)
--> 135 return [default_collate(samples) for samples in transposed]
136
137 raise TypeError((error_msg.format(type(batch[0]))))
/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.pyc in default_collate(batch)
110 storage = batch[0].storage()._new_shared(numel)
111 out = batch[0].new(storage)
--> 112 return torch.stack(batch, 0, out=out)
113 elif elem_type.__module__ == 'numpy' and elem_type.__name__ != 'str_' \
114 and elem_type.__name__ != 'string_':
/usr/local/lib/python2.7/dist-packages/torch/functional.pyc in stack(sequence, dim, out)
62 inputs = [t.unsqueeze(dim) for t in sequence]
63 if out is None:
---> 64 return torch.cat(inputs, dim)
65 else:
66 return torch.cat(inputs, dim, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 400 and 487 in dimension 2 at /pytorch/torch/lib/TH/generic/THTensorMath.c:2897
谢谢
我在您的代码中看到两个问题,首先您将 import torch.utils.data 作为数据导入,然后再次在数据加载器中替换它。请将导入的模块和您的变量名保存在单独的命名空间中。我认为这个错误可能是因为 dataloder(images) 和标签返回的数据大小不同。如您所见,连接中存在错误,因为第一个维度即。文件夹中的标签大小和图像数量不匹配。希望这有帮助。
我认为我对 Manoj Acharya 的评论是错误的,问题出在将 batch_size 放入数据加载器中。我阅读了以下来源,似乎您无法将不同尺寸的图像批量处理:
https://medium.com/@yvanscher/pytorch-tip-yielding-image-sizes-6a776eb4115b
所以在我的代码中更改数据变量后 Manoj 指出我将 batch_size 更改为 1 并且程序停止失败。不过我想分批处理,所以我添加了一个进一步的转换 CenterCrop() 以将所有图像调整为相同大小。下面是我的新代码:
!pip install torch
!pip install torchvision
from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
# For showing and formatting images
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
# For importing datasets into pytorch
import torchvision.datasets as dataset
# Used for dataloaders
from torch.utils.data import DataLoader
# For pretrained resnet34 model
import torchvision.models as models
# For optimisation function
import torch.nn as nn
import torch.optim as optim
# For turning data into tensors
import torchvision.transforms as transforms
!wget http://files.fast.ai/data/dogscats.zip
!unzip dogscats.zip
batch_size = 256
sz = 224
train_raw = dataset.ImageFolder(PATH+"train", transform=transforms.Compose([transforms.CenterCrop(sz),transforms.ToTensor()]))
train_loader = DataLoader(train_raw,batch_size=batch_size, shuffle=True)
for batch_idx, (data, target) in enumerate(train_loader):
print("Data: ", batch_idx)
谢谢
我认为主要问题是图像大小不同。我可能以其他方式理解 ImageFolder 但是,我认为如果目录结构如 pytorch 中指定的那样,你不需要图像标签,并且 pytorch 会为你找出标签。 我还会向您的转换添加更多内容,自动调整文件夹中每个图像的大小,例如:
normalize = transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
transform = transforms.Compose(
[transforms.ToTensor(),transforms.Resize((224,224)),
normalize])
您还可以使用其他技巧来使您的 DataLoader 更快,例如添加 batch_size 和 cpu worker 的数量,例如:
testloader = DataLoader(testset, batch_size=16,
shuffle=False, num_workers=4)
我认为这会让你的流水线更快。