等待。 BoW 和 Contextual Embeddings 有不同的大小

Wait. BoW and Contextual Embeddings have different sizes

使用 OCTIS 包,我正在 运行BBC(默认)数据集上的 CTM 主题模型。

import octis
from octis.dataset.dataset import Dataset
from octis.models.CTM import CTM

datasets = ['BBC_news', '20NewsGroup', 'DBLP','M10']
num_topics = [i for i in range(5,101,5)]

ALGORITHM = CTM

def create_topic_dict(algorithm):
    run_dict = dict()
    for data in datasets:
        data_dict = dict()
        for top in num_topics:    
            dataset = Dataset()
            dataset.fetch_dataset(data)
            model = algorithm(num_topics=top)
            trained_model = model.train_model(dataset)
            data_dict[top] = trained_model
        run_dict[data] = data_dict
    return run_dict

topic_dict = dict()

我多次调用这个函数,所以我的结果更稳健。 但是,在第一次通话中,我收到以下异常:

Run: 0

Batches:   0%|          | 0/16 [00:00<?, ?it/s]
Batches:   6%|▋         | 1/16 [00:37<09:21, 37.42s/it]
Batches:  12%|█▎        | 2/16 [01:13<08:36, 36.87s/it]
Batches:  19%|█▉        | 3/16 [01:50<07:56, 36.66s/it]
Batches:  25%|██▌       | 4/16 [02:26<07:19, 36.65s/it]
Batches:  31%|███▏      | 5/16 [03:03<06:43, 36.65s/it]
Batches:  38%|███▊      | 6/16 [03:40<06:05, 36.59s/it]
Batches:  44%|████▍     | 7/16 [04:16<05:28, 36.55s/it]
Batches:  50%|█████     | 8/16 [04:52<04:51, 36.44s/it]
Batches:  56%|█████▋    | 9/16 [05:26<04:09, 35.65s/it]
Batches:  62%|██████▎   | 10/16 [05:53<03:17, 32.94s/it]
Batches:  69%|██████▉   | 11/16 [06:19<02:33, 30.71s/it]
Batches:  75%|███████▌  | 12/16 [06:41<01:52, 28.12s/it]
Batches:  81%|████████▏ | 13/16 [07:02<01:18, 26.09s/it]
Batches:  88%|████████▊ | 14/16 [07:21<00:47, 23.87s/it]
Batches:  94%|█████████▍| 15/16 [07:37<00:21, 21.57s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 17.16s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 29.04s/it]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]
Batches:  25%|██▌       | 1/4 [00:36<01:50, 36.80s/it]
Batches:  50%|█████     | 2/4 [01:14<01:14, 37.15s/it]
Batches:  75%|███████▌  | 3/4 [01:41<00:32, 32.63s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 21.82s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 26.67s/it]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]
Batches:  25%|██▌       | 1/4 [00:34<01:42, 34.31s/it]
Batches:  50%|█████     | 2/4 [01:08<01:08, 34.46s/it]
Batches:  75%|███████▌  | 3/4 [01:35<00:31, 31.02s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 20.53s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 25.07s/it]
Traceback (most recent call last):
  File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 38, in <module>
    topic_dict[run] = create_topic_dict(ALGORITHM)
  File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 28, in create_topic_dict
    trained_model = model.train_model(dataset)
  File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 177, in preprocess
    train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
  File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py", line 17, in __init__
    raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents. 

为什么我会得到这个异常?我该怎么做才能解决这个问题?看来我已经采取了 运行 模型所需的步骤。

我是 OCTIS 的开发者之一。

简答: 如果我理解你的问题,你可以通过修改 CTM 的参数“bert_path”并使其特定于数据集来解决这个问题,例如CTM(bert_path="path/to/store/the/files/" + data)

TL;DR: 我认为问题与 CTM 生成文档表示并将其存储在某些具有默认名称的文件中有关。如果这些文件已经存在,它会使用它们而不生成新的表示,即使数据集在此期间发生了变化。然后 CTM 会提出这个问题,因为它使用的是一个数据集的 BOW 表示,但使用另一个数据集的上下文表示,导致两个具有不同维度的表示。根据数据集的名称更改文件的名称将允许模型检索正确的表示。

如有其他问题,请开一个GitHub issue in the repo。我偶然发现了这个问题。