在不同的机器上训练 PyTorch 模型会导致不同的结果

Training PyTorch models on different machines leads to different results

我在两台不同的机器上训练同一个模型,但训练的模型并不相同。我采取了以下措施来确保可重复性:

# set random number 
random.seed(0)
torch.cuda.manual_seed(0)
np.random.seed(0)
# set the cudnn
torch.backends.cudnn.benchmark=False
torch.backends.cudnn.deterministic=True
# set data loader work threads to be 0
DataLoader(dataset, num_works=0)

当我在同一台机器上多次训练同一个模型时,训练的模型总是一样的。但是,在两台不同的机器上训练出来的模型是不一样的。这是正常的吗?我还能使用其他技巧吗?

有许多领域可以额外引入随机性,例如:

PyTorch random number generator

You can use torch.manual_seed() to seed the RNG for all devices (both CPU and CUDA):

CUDA convolution determinism

While disabling CUDA convolution benchmarking (discussed above) ensures that CUDA selects the same algorithm each time an application is run, that algorithm itself may be nondeterministic, unless either torch.use_deterministic_algorithms(True) or torch.backends.cudnn.deterministic = True is set. The latter setting controls only this behavior, unlike torch.use_deterministic_algorithms() which will make other PyTorch operations behave deterministically, too.

CUDA RNN and LSTM

In some versions of CUDA, RNNs and LSTM networks may have non-deterministic behavior. See torch.nn.RNN() and torch.nn.LSTM() for details and workarounds.

DataLoader

DataLoader will reseed workers following Randomness in multi-process data loading algorithm. Use worker_init_fn() to preserve reproducibility: