如何在 Google Colab 上以流模式加载数据集?

How to load a dataset in streaming mode on Google Colab?

我正在尝试保存一些磁盘 space 以便在 Google Colab 上使用 CommonVoice 法语数据集 (19G),因为我的笔记本总是崩溃出磁盘 space。我从 HuggingFace 文档中看到我们可以以流模式加载数据集,因此我们可以 iterate over it directly without having to download the entire dataset.。我尝试在 Google Colab 中使用该模式,但无法正常工作 - 我还没有找到关于此问题的任何信息。

!pip install datasets
!pip install 'datasets[streaming]'
!pip install aiohttp

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

然后,我收到以下错误:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-24-489f8a0ca4e4> in <module>()
----> 1 common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

/usr/local/lib/python3.7/dist-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, ignore_verifications, keep_in_memory, save_infos, script_version, use_auth_token, task, streaming, **config_kwargs)
    811         if not config.AIOHTTP_AVAILABLE:
    812             raise ImportError(
--> 813                 f"To be able to use dataset streaming, you need to install dependencies like aiohttp "
    814                 f'using "pip install \'datasets[streaming]\'" or "pip install aiohttp" for instance'
    815             )

ImportError: To be able to use dataset streaming, you need to install dependencies like aiohttp using "pip install 'datasets[streaming]'" or "pip install aiohttp" for instance

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Google Colab 不允许流式加载数据集是否有原因?

否则,我错过了什么?

写一个答案,方便以后参考。根据@kkgarg 的评论,流式传输功能似乎尚未实现。

!pip install aiohttp
!pip install datasets
from datasets import load_dataset, load_metric

common_voice_train = load_dataset("common_voice", "fr", split="train", streaming=True)

触发以下错误:

/usr/local/lib/python3.7/dist-packages/datasets/utils/streaming_download_manager.py in _get_extraction_protocol(self, urlpath)
    137         elif path.endswith(".zip"):
    138             return "zip"
--> 139         raise NotImplementedError(f"Extraction protocol for file at {urlpath} is not implemented yet")
    140 
    141     def download_and_extract(self, url_or_urls):

NotImplementedError: Extraction protocol for file at https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/tr.tar.gz is not implemented yet

表示流式传输功能尚未实现或不受支持。可能是因为使用 common_voice 意味着文件需要解压缩,而流式传输不支持(?)。因为该功能肯定已实现,因为它在文档中...