不能用 allennlp 延迟加载

Can't do lazy loading with allennlp

目前我正在尝试使用 allennlp 实现延迟加载,但不能。 我的代码如下。

def biencoder_training():
    params = BiEncoderExperiemntParams()
    config = params.opts
    reader = SmallJaWikiReader(config=config)

    # Loading Datasets
    train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
    vocab = build_vocab(train)
    vocab.extend_from_instances(dev)

    # TODO: avoid memory consumption and lazy loading
    train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))

    train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    embedder = emb_returner()
    mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
                                      Pooler_for_cano_and_def(word_embedder=embedder)

    model = Biencoder(mention_encoder, entity_encoder, vocab)

    trainer = build_trainer(lr=config.lr,
                            num_epochs=config.num_epochs,
                            model=model,
                            train_loader=train_loader,
                            dev_loader=dev_loader)
    trainer.train()

    return model

当我注释掉 train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test')) 时,迭代器不起作用,训练是用 0 个样本进行的。

Building the vocabulary
100it [00:00, 442.15it/s]01, 133.57it/s]
building vocab: 100it [00:01, 95.84it/s]
100it [00:00, 413.40it/s]
100it [00:00, 138.38it/s]
You provided a validation dataset but patience was set to None, meaning that early stopping is disabled
0it [00:00, ?it/s]
0it [00:00, ?it/s]

我想知道是否有任何解决方案可以避免这种情况。 谢谢。


补充,5月5日补充。

目前我正在尝试避免在训练模型之前将所有样本数据都放在内存之上。

所以我实现了 _read 方法作为生成器。我的理解是,通过调用这个方法并用 SimpleDataLoader 包装它,我实际上可以将数据传递给模型。

在 DatasetReader 中,_read 方法的代码如下所示。据我了解,这是一个避免内存消耗的生成器。

    @overrides
    def _read(self, train_dev_test_flag: str) -> Iterator[Instance]:
        '''
        :param train_dev_test_flag: 'train', 'dev', 'test'
        :return: list of instances
        '''
        if train_dev_test_flag == 'train':
            dataset = self._train_loader()
            random.shuffle(dataset)
        elif train_dev_test_flag == 'dev':
            dataset = self._dev_loader()
        elif train_dev_test_flag == 'test':
            dataset = self._test_loader()
        else:
            raise NotImplementedError(
                "{} is not a valid flag. Choose from train, dev and test".format(train_dev_test_flag))

        if self.config.debug:
            dataset = dataset[:self.config.debug_data_num]

        for data in tqdm(enumerate(dataset)):
            data = self._one_line_parser(data=data, train_dev_test_flag=train_dev_test_flag)
            yield self.text_to_instance(data)

另外,build_data_loaders实际上是这样的。

def build_data_loaders(config,
    train_data: List[Instance],
    dev_data: List[Instance],
    test_data: List[Instance]) -> Tuple[DataLoader, DataLoader, DataLoader]:

    train_loader = SimpleDataLoader(train_data, config.batch_size_for_train, shuffle=False)
    dev_loader = SimpleDataLoader(dev_data, config.batch_size_for_eval, shuffle=False)
    test_loader = SimpleDataLoader(test_data, config.batch_size_for_eval, shuffle=False)

    return train_loader, dev_loader, test_loader

但是,出于某种我不知道的原因,这段代码不起作用。

def biencoder_training():
    params = BiEncoderExperiemntParams()
    config = params.opts
    reader = SmallJaWikiReader(config=config)

    # Loading Datasets
    train, dev, test = reader.read('train'), reader.read('dev'), reader.read('test')
    vocab = build_vocab(train)
    vocab.extend_from_instances(dev)

    train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test)
    train_loader.index_with(vocab)
    dev_loader.index_with(vocab)

    embedder = emb_returner()
    mention_encoder, entity_encoder = Pooler_for_mention(word_embedder=embedder), \
                                      Pooler_for_cano_and_def(word_embedder=embedder)

    model = Biencoder(mention_encoder, entity_encoder, vocab)

    trainer = build_trainer(lr=config.lr,
                            num_epochs=config.num_epochs,
                            model=model,
                            train_loader=train_loader,
                            dev_loader=dev_loader)
    trainer.train()

    return model

在此代码中,SimpleDataLoader 将按原样包装生成器类型。我想做 allennlp 在 0.9 版本中做的延迟加载。

但是这段代码在 0 个实例上迭代训练,所以目前我已经添加了

train, dev, test = list(reader.read('train')), list(reader.read('dev')), list(reader.read('test'))

之前

train_loader, dev_loader, test_loader = build_data_loaders(config, train, dev, test).

并且有效。但这意味着我无法训练或评估模型,直到我将所有实例都存储在内存中。相反,我希望每个批次仅在需要训练时才被调用到内存中。

SimpleDataLoader 无法延迟加载。您应该改用 MultiProcessDataLoader。将 max_instances_in_memory 设置为非零整数(通常是批量大小的几倍)将触发延迟加载。