我如何错误地使用 SubsetRandomSampler？

Question

我有一个自定义数据集： rcvdataset = rcvLSTMDataSet('foo.csv', 'foolabels.csv')

我也定义如下：

batch_size = 50
validation_split = .2
shuffle_rcvdataset = True
random_seed= 42


```
rcvdataset_size = len(rcvdataset)
indices = list(range(rcvdataset_size))
split = int(np.floor(validation_split * rcvdataset_size))
if shuffle_rcvdataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]


train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(val_indices)

train_loader = torch.utils.data.DataLoader(rcvdataset, batch_size=batch_size, 
                                           sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(rcvdataset, batch_size=batch_size,
                                                sampler=test_sampler)
```

使用这个训练电话：

```
def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
```

但是当我尝试运行它时，我得到：

    Epoch 1
    -------------------------------
    Traceback (most recent call last):
      File "lstmTrainer.py", line 94, in <module>
        train(train_sampler, model, loss_fn, optimizer)
      File "lstmTrainer.py", line 58, in train
        size = len(dataloader.dataset)
    AttributeError: 'SubsetRandomSampler' object has no attribute 'dataset'

如果我改为间接加载数据集：

    train(train_loader, model, loss_fn, optimizer)

它告诉我：

    TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'pandas.core.series.Series'>

我完全不清楚第一个错误是什么。第二个错误是否试图告诉我数据集中某处不是张量？

谢谢。

应要求，这里是rcvDataSet.py:

from __future__ import print_function, division
import os
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
class rcvLSTMDataSet(Dataset):
    """rcv dataset."""
    TIMESTEPS = 10

    def __init__(self, csv_data_file, annotations_file):
        """
        Args:
            csv_data_file (string): Path to the csv file with the training data
            annotations_file (string): Path to the file with the annotations
            
        """
        
        self.csv_data_file = csv_data_file
        self.annotations_file = annotations_file
        self.labels = pd.read_csv(annotations_file)
        self.data = pd.read_csv(csv_data_file)
        

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        """
        pytorch expects whatever data is returned is in the form of a tensor.  Included, it expects the label for the data.
        Together, they make a tuple.          
        """
        
        # convert every ten indexes and label into one observation
        Observation = []
        counter = 0
        start_pos = self.TIMESTEPS *idx
        avg_1 = 0
        avg_2 = 0
        avg_3 = 0
        while counter < self.TIMESTEPS:
            Observation.append(self.data.iloc[idx + counter])            
            avg_1 += self.labels.iloc[idx + counter][2]
            avg_2 += self.labels.iloc[idx + counter][1]
            avg_3 += self.labels.iloc[idx + counter][0]
            counter += 1        
        
        avg_1 = avg_1 / self.TIMESTEPS
        avg_2 = avg_2 / self.TIMESTEPS
        avg_3 = avg_3 / self.TIMESTEPS
        current_labels = [avg_1, avg_2, avg_3]
        print(current_labels)
        return Observation, current_labels
        

def main():
        loader = rcvLSTMDataSet('foo1.csv','foo2.csv')
        j = 0
        while j < len(loader.data % loader.TIMESTEPS):
            print(loader.__getitem__(j))
            j += 1

if "__main__" == __name__:
    main()

Answer 1

原因：如果你查看错误信息，你会发现你是这样调用train函数的：

train(train_sampler, model, loss_fn, optimizer)

这是不正确的，您应该使用 train_loader 而不是 train_sampler.

调用 train()

解决方法： 更正为：

train(train_loader, model, loss_fn, optimizer)

错误信息：

 Epoch 1
    -------------------------------
    Traceback (most recent call last):
      File "lstmTrainer.py", line 94, in <module>
        train(train_sampler, model, loss_fn, optimizer)   <------ look here 
      File "lstmTrainer.py", line 58, in train
        size = len(dataloader.dataset)
    AttributeError: 'SubsetRandomSampler' object has no attribute 'dataset'

第二条错误信息：

如果你查看你的数据集 class rcvLSTMDataSet，你会发现 observations 列表附加项具有 pandas.core.series.Series 类型，它不是 pythonic 标量数字，因为您阅读了 csv 文件中的所有列。您应该使用 .iloc[....].values 而不是 iloc[....]。通过这样做，您将确保您的列表包含 char 或 float 或 int 类型，并且可以顺利地将其转换为张量而不会出错。

最后备注：

你可以阅读here关于Dataloader和Samplers的内容，我在这里总结了一些要点：

Samplers are used to specify the sequence of indices/keys used in data loading.

Data loader combines a dataset and a sampler, and provides an iterable over the given dataset.

PyTorch provides two data primitives: torch.utils.data.DataLoader and torch.utils.data.Dataset that allow you to use pre-loaded datasets as well as your own data. Dataset stores the samples and their corresponding labels, and DataLoader wraps an iterable around the Dataset to enable easy access to the samples.

我如何错误地使用 SubsetRandomSampler？

How am i incorrectly using SubsetRandomSampler?

pytorch

tensor