训练和验证步骤中的 PyTorch 无限循环

PyTorch infinite loop in the training and validation step

Dataset 和 DataLoader 的部分没问题,我从我构建的另一个代码中回收,但在我的代码中的那个部分出现了无限循环:

def train(train_loader, MLP, epoch, criterion, optimizer):

 MLP.train()
 epoch_loss = []

 for batch in train_loader:

    optimizer.zero_grad()
    sample, label = batch

    #Forward
    pred = MLP(sample)
    loss = criterion(pred, label)
    epoch_loss.append(loss.data)

    #Backward
    loss.backward()
    optimizer.step()

 epoch_loss = np.asarray(epoch_loss)

 print('Epoch: {}, Loss: {:.4f} +/- {:.4f}'.format(epoch+1, 
 epoch_loss.mean(), epoch_loss.std()))



def test(test_loader, MLP, epoch, criterion):

 MLP.eval()
 with torch.no_grad():
    epoch_loss = []

    for batch in train_loader:

        sample, label = batch

        #Forward
        pred = MLP(sample)
        loss = criterion(pred, label)
        epoch_loss.append(loss.data)

    epoch_loss = np.asarray(epoch_loss)

    print('Epoch: {}, Loss: {:.4f} +/- {:.4f}'.format(epoch+1, 
    epoch_loss.mean(), epoch_loss.std()))

然后,我用它来遍历时代:

for epoch in range(args['num_epochs']):
    train(train_loader, MLP, epoch, criterion, optimizer)
    test(test_loader, MLP, epoch, criterion)
    print('-----------------------')

由于连第一个loss的数据都不打印,我认为是训练函数的逻辑错误,但不知道出在哪里。

编辑:这是我的 MLP Class,问题也可能在这里:

class BikeRegressor(nn.Module):

 def __init__(self, input_size, hidden_size, out_size):
    super(BikeRegressor, self).__init__()
    
    self.features = nn.Sequential(nn.Linear(input_size, hidden_size),
                                  nn.ReLU(),
                                  nn.Linear(hidden_size, hidden_size),
                                  nn.ReLU())
    
    self.out = nn.Sequential(nn.Linear(hidden_size, out_size),
                             nn.ReLU())
    
 def forward(self, X):
    
    hidden = self.features(X)
    output = self.out(hidden)
    
    return output

编辑 2:数据集和数据加载器:

class Bikes(Dataset):
 def __init__(self, data): #data is a Dataframe from Pandas
    self.datas = data.to_numpy()
    
 def __getitem__(self, idx): 
    sample = self.datas[idx][2:14] 
    label = self.datas[idx][-1:] 
    
    
    sample = torch.from_numpy(sample.astype(np.float32))
    label = torch.from_numpy(label.astype(np.float32))
    
    return sample, label

 def __len__(self):
    return len(self.datas)



train_set = Bikes(ds_train)
test_set = Bikes(ds_test)



train_loader = DataLoader(train_set, batch_size=args['batch_size'], shuffle=True, num_workers=args['num_workers'])
test_loader = DataLoader(test_set, batch_size=args['batch_size'], shuffle=True, num_workers=args['num_workers'])

我遇到了同样的问题,问题是 jupyter notebook 可能无法正常使用多处理,如记录的那样 here:

Note Functionality within this package requires that the __ main __ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the Pool examples will not work in the interactive interpreter.

您有三种选择来解决您的问题:

  • train_loadertest_loader中设置num_worker = 0。 (最简单的)
  • 将您的代码移至 google colab。它与 num_worker = 6 一起工作,但我认为这取决于您的程序将使用多少内存。因此,尝试逐渐增加 num_worker 直到您的程序兑现告诉您您的程序内存不足。
  • 在 jupyter 中调整您的程序以支持多处理,这些资源 1, 可以提供帮助。