PyTorch Lightning 子模型列表不会传输到 GPU

Lists of PyTorch Lightning sub-models don't get transferred to GPU

在 CPU 上使用 PyTorch Lightning 时,一切正常。但是,当使用 GPU 时,我得到 RuntimeError: Expected all tensors to be on the same device.

问题似乎出在模型中,该模型使用了未传递给 GPU 的子模型列表:

class LambdaLayer(LightningModule):
    def __init__(self, fun):
        super(LambdaLayer, self).__init__()
        self.fun = fun

    def forward(self, x):
        return self.fun(x)

class TorchModel(LightningModule):
    def __init__(self):
        super(TorchModel, self).__init__()
        self.cat_layers = [TorchCatEmbedding(cat) for cat in columns_to_embed]
        self.num_layers = [LambdaLayer(lambda x: x[:, idx:idx+1]) for _, idx in numeric_columns]
        self.ffo = TorchFFO(len(self.num_layers) + sum([embed_dim(l) for l in self.cat_layers]), y.shape[1])
        self.softmax = torch.nn.Softmax(dim=1)

model = TorchModel()
trainer = Trainer(gpus=-1)

之前 运行 trainer(model):

>>> model.device
device(type='cpu')

>>> model.ffo.device
device(type='cpu')

>>> model.cat_layers[0].device
device(type='cpu')

运行trainer(model)之后:

>>> model.device
device(type='cuda', index=0) # <---- correct

>>> model.ffo.device
device(type='cuda', index=0) # <---- correct

>>> model.cat_layers[0].device
device(type='cpu') # <---- still showing 'cpu'

显然,PyTorch Lightning 无法将子模型列表传输到 GPU。如何将整个模型,包括子模型列表(cat_layersnum_layers)转移到 GPU?

列表中包含的子模块未注册,无法按原样转换。 您需要使用 ModuleList 代替,即:

...
from torch.nn import ModuleList
...

class TorchModel(LightningModule):
    def __init__(self):
        super(TorchModel, self).__init__()
        self.cat_layers = ModuleList([TorchCatEmbedding(cat) for cat in columns_to_embed])
        self.num_layers = ModuleList([LambdaLayer(lambda x: x[:, idx:idx+1]) for _, idx in numeric_columns])
        self.ffo = TorchFFO(len(self.num_layers) + sum([embed_dim(l) for l in self.cat_layers]), y.shape[1])
        self.softmax = torch.nn.Softmax(dim=1)

编辑:我不确定闪电等效物是什么,或者如果存在这样的等效物,另请参阅 PyTorch Lightning - LightningModule for ModuleList / ModuleDict?