NonMatchingSplitsSizesError 加载 huggingface BookCorpus
NonMatchingSplitsSizesError loading huggingface BookCorpus
我想像这样加载 bookcorpus
:
train_ds, test_ds = load_dataset('bookcorpus', split=['train', 'test']),
但是,得到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/load.py", line 1627, in load_dataset
builder_instance.download_and_prepare(
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/builder.py", line 607, in download_and_prepare
self._download_and_prepare(
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/builder.py", line 709, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=4853859824, num_examples=74004228, dataset_name='bookcorpus'), 'recorded': SplitInfo(name='train', num_bytes=2982081448, num_examples=45726619, dataset_name='bookcorpus')}]
我想继续将它保存到磁盘,因为我不想每次使用它时都下载它。是什么原因导致此错误?
BookCorpus 不再公开。
这里有一个解决方法:
我想像这样加载 bookcorpus
:
train_ds, test_ds = load_dataset('bookcorpus', split=['train', 'test']),
但是,得到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/load.py", line 1627, in load_dataset
builder_instance.download_and_prepare(
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/builder.py", line 607, in download_and_prepare
self._download_and_prepare(
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/builder.py", line 709, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/home/marcelbraasch/.local/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=4853859824, num_examples=74004228, dataset_name='bookcorpus'), 'recorded': SplitInfo(name='train', num_bytes=2982081448, num_examples=45726619, dataset_name='bookcorpus')}]
我想继续将它保存到磁盘,因为我不想每次使用它时都下载它。是什么原因导致此错误?
BookCorpus 不再公开。
这里有一个解决方法: