等待。 BoW 和 Contextual Embeddings 有不同的大小
Wait. BoW and Contextual Embeddings have different sizes
使用 OCTIS 包,我正在 运行BBC(默认)数据集上的 CTM 主题模型。
import octis
from octis.dataset.dataset import Dataset
from octis.models.CTM import CTM
datasets = ['BBC_news', '20NewsGroup', 'DBLP','M10']
num_topics = [i for i in range(5,101,5)]
ALGORITHM = CTM
def create_topic_dict(algorithm):
run_dict = dict()
for data in datasets:
data_dict = dict()
for top in num_topics:
dataset = Dataset()
dataset.fetch_dataset(data)
model = algorithm(num_topics=top)
trained_model = model.train_model(dataset)
data_dict[top] = trained_model
run_dict[data] = data_dict
return run_dict
topic_dict = dict()
我多次调用这个函数,所以我的结果更稳健。
但是,在第一次通话中,我收到以下异常:
Run: 0
Batches: 0%| | 0/16 [00:00<?, ?it/s]
Batches: 6%|▋ | 1/16 [00:37<09:21, 37.42s/it]
Batches: 12%|█▎ | 2/16 [01:13<08:36, 36.87s/it]
Batches: 19%|█▉ | 3/16 [01:50<07:56, 36.66s/it]
Batches: 25%|██▌ | 4/16 [02:26<07:19, 36.65s/it]
Batches: 31%|███▏ | 5/16 [03:03<06:43, 36.65s/it]
Batches: 38%|███▊ | 6/16 [03:40<06:05, 36.59s/it]
Batches: 44%|████▍ | 7/16 [04:16<05:28, 36.55s/it]
Batches: 50%|█████ | 8/16 [04:52<04:51, 36.44s/it]
Batches: 56%|█████▋ | 9/16 [05:26<04:09, 35.65s/it]
Batches: 62%|██████▎ | 10/16 [05:53<03:17, 32.94s/it]
Batches: 69%|██████▉ | 11/16 [06:19<02:33, 30.71s/it]
Batches: 75%|███████▌ | 12/16 [06:41<01:52, 28.12s/it]
Batches: 81%|████████▏ | 13/16 [07:02<01:18, 26.09s/it]
Batches: 88%|████████▊ | 14/16 [07:21<00:47, 23.87s/it]
Batches: 94%|█████████▍| 15/16 [07:37<00:21, 21.57s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 17.16s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 29.04s/it]
Batches: 0%| | 0/4 [00:00<?, ?it/s]
Batches: 25%|██▌ | 1/4 [00:36<01:50, 36.80s/it]
Batches: 50%|█████ | 2/4 [01:14<01:14, 37.15s/it]
Batches: 75%|███████▌ | 3/4 [01:41<00:32, 32.63s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 21.82s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 26.67s/it]
Batches: 0%| | 0/4 [00:00<?, ?it/s]
Batches: 25%|██▌ | 1/4 [00:34<01:42, 34.31s/it]
Batches: 50%|█████ | 2/4 [01:08<01:08, 34.46s/it]
Batches: 75%|███████▌ | 3/4 [01:35<00:31, 31.02s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 20.53s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 25.07s/it]
Traceback (most recent call last):
File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 38, in <module>
topic_dict[run] = create_topic_dict(ALGORITHM)
File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 28, in create_topic_dict
trained_model = model.train_model(dataset)
File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 95, in train_model
x_train, x_test, x_valid, input_size = self.preprocess(
File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 177, in preprocess
train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py", line 17, in __init__
raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.
为什么我会得到这个异常?我该怎么做才能解决这个问题?看来我已经采取了 运行 模型所需的步骤。
我是 OCTIS 的开发者之一。
简答:
如果我理解你的问题,你可以通过修改 CTM 的参数“bert_path”并使其特定于数据集来解决这个问题,例如CTM(bert_path="path/to/store/the/files/" + data)
TL;DR:
我认为问题与 CTM 生成文档表示并将其存储在某些具有默认名称的文件中有关。如果这些文件已经存在,它会使用它们而不生成新的表示,即使数据集在此期间发生了变化。然后 CTM 会提出这个问题,因为它使用的是一个数据集的 BOW 表示,但使用另一个数据集的上下文表示,导致两个具有不同维度的表示。根据数据集的名称更改文件的名称将允许模型检索正确的表示。
如有其他问题,请开一个GitHub issue in the repo。我偶然发现了这个问题。
使用 OCTIS 包,我正在 运行BBC(默认)数据集上的 CTM 主题模型。
import octis
from octis.dataset.dataset import Dataset
from octis.models.CTM import CTM
datasets = ['BBC_news', '20NewsGroup', 'DBLP','M10']
num_topics = [i for i in range(5,101,5)]
ALGORITHM = CTM
def create_topic_dict(algorithm):
run_dict = dict()
for data in datasets:
data_dict = dict()
for top in num_topics:
dataset = Dataset()
dataset.fetch_dataset(data)
model = algorithm(num_topics=top)
trained_model = model.train_model(dataset)
data_dict[top] = trained_model
run_dict[data] = data_dict
return run_dict
topic_dict = dict()
我多次调用这个函数,所以我的结果更稳健。 但是,在第一次通话中,我收到以下异常:
Run: 0
Batches: 0%| | 0/16 [00:00<?, ?it/s]
Batches: 6%|▋ | 1/16 [00:37<09:21, 37.42s/it]
Batches: 12%|█▎ | 2/16 [01:13<08:36, 36.87s/it]
Batches: 19%|█▉ | 3/16 [01:50<07:56, 36.66s/it]
Batches: 25%|██▌ | 4/16 [02:26<07:19, 36.65s/it]
Batches: 31%|███▏ | 5/16 [03:03<06:43, 36.65s/it]
Batches: 38%|███▊ | 6/16 [03:40<06:05, 36.59s/it]
Batches: 44%|████▍ | 7/16 [04:16<05:28, 36.55s/it]
Batches: 50%|█████ | 8/16 [04:52<04:51, 36.44s/it]
Batches: 56%|█████▋ | 9/16 [05:26<04:09, 35.65s/it]
Batches: 62%|██████▎ | 10/16 [05:53<03:17, 32.94s/it]
Batches: 69%|██████▉ | 11/16 [06:19<02:33, 30.71s/it]
Batches: 75%|███████▌ | 12/16 [06:41<01:52, 28.12s/it]
Batches: 81%|████████▏ | 13/16 [07:02<01:18, 26.09s/it]
Batches: 88%|████████▊ | 14/16 [07:21<00:47, 23.87s/it]
Batches: 94%|█████████▍| 15/16 [07:37<00:21, 21.57s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 17.16s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 29.04s/it]
Batches: 0%| | 0/4 [00:00<?, ?it/s]
Batches: 25%|██▌ | 1/4 [00:36<01:50, 36.80s/it]
Batches: 50%|█████ | 2/4 [01:14<01:14, 37.15s/it]
Batches: 75%|███████▌ | 3/4 [01:41<00:32, 32.63s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 21.82s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 26.67s/it]
Batches: 0%| | 0/4 [00:00<?, ?it/s]
Batches: 25%|██▌ | 1/4 [00:34<01:42, 34.31s/it]
Batches: 50%|█████ | 2/4 [01:08<01:08, 34.46s/it]
Batches: 75%|███████▌ | 3/4 [01:35<00:31, 31.02s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 20.53s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 25.07s/it]
Traceback (most recent call last):
File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 38, in <module>
topic_dict[run] = create_topic_dict(ALGORITHM)
File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 28, in create_topic_dict
trained_model = model.train_model(dataset)
File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 95, in train_model
x_train, x_test, x_valid, input_size = self.preprocess(
File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 177, in preprocess
train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py", line 17, in __init__
raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.
为什么我会得到这个异常?我该怎么做才能解决这个问题?看来我已经采取了 运行 模型所需的步骤。
我是 OCTIS 的开发者之一。
简答:
如果我理解你的问题,你可以通过修改 CTM 的参数“bert_path”并使其特定于数据集来解决这个问题,例如CTM(bert_path="path/to/store/the/files/" + data)
TL;DR: 我认为问题与 CTM 生成文档表示并将其存储在某些具有默认名称的文件中有关。如果这些文件已经存在,它会使用它们而不生成新的表示,即使数据集在此期间发生了变化。然后 CTM 会提出这个问题,因为它使用的是一个数据集的 BOW 表示,但使用另一个数据集的上下文表示,导致两个具有不同维度的表示。根据数据集的名称更改文件的名称将允许模型检索正确的表示。
如有其他问题,请开一个GitHub issue in the repo。我偶然发现了这个问题。