如何将一个太大的 Kaggle 数据集的选定文件从 Kaggle 加载到 Colab

Question

如果我想从 Kaggle 笔记本切换到 Colab 笔记本，我可以从 Kaggle 下载笔记本并在 Google Colab 中打开笔记本。这样做的问题是您通常还需要下载和上传 Kaggle 数据集，这非常费力。

如果你有一个小数据集，或者如果你只需要一个较小的数据集文件，你可以将数据集放入 Kaggle notebook 期望的相同文件夹结构中。因此，您需要在 Google Colab 中创建该结构，例如 kaggle/input/ 或其他任何内容，然后将其上传到那里。那不是问题。

不过，如果您的数据集很大，您可以：

挂载您的 Google 驱动器并使用那里的数据集/文件

或者您从 Kaggle 下载 Kaggle 数据集到 colab，按照 Easiest way to download kaggle data in Google Colab 上的官方 Colab 指南，请使用 link 了解更多详情：

Please follow the steps below to download and use kaggle data within Google Colab:

Go to your Kaggle account, Scroll to API section and Click Expire API Token to remove previous tokens

Click on Create New API Token - It will download kaggle.json file on your machine.

Go to your Google Colab project file and run the following commands:
   ! pip install -q kaggle
Choose the kaggle.json file that you downloaded
from google.colab import files

files.upload()
Make directory named kaggle and copy kaggle.json file there.
! mkdir ~/.kaggle

! cp kaggle.json ~/.kaggle/
Change the permissions of the file.
! chmod 600 ~/.kaggle/kaggle.json
That's all ! You can check if everything's okay by running this command.
! kaggle datasets list
Download Data
   ! kaggle competitions download -c 'name-of-competition'

或者如果你想下载数据集（取自评论）：

! kaggle datasets download -d USERNAME/DATASET_NAME
You can get these dataset names (if unclear) from "copy API command" in the "three-dots drop down" next to "New Notebook" button on the Kaggle dataset page.

问题来了：这似乎只适用于较小的数据集。我试过

kaggle datasets download -d allen-institute-for-ai/CORD-19-research-challenge

并没有找到API，可能是因为下载40GB的数据被限制了：404 - Not Found.

遇到这种情况，只能下载需要的文件，使用挂载的Google驱动，或者需要使用Kaggle代替Colab。

有没有办法将 40 GB CORD-19 Kaggle 数据集的 800 MB metadata.csv 文件下载到 Colab？这是文件信息页面的link：

https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge?select=metadata.csv

我现在已经将文件加载到 Google 驱动器中，我很好奇这是否已经是最好的方法。相比之下，在Kaggle上，整个数据集已经可用，无需下载，加载速度很快。

PS：从Kaggle下载zip文件到Colab后，需要解压。再次引用原话：

Use unzip command to unzip the data:

For example, create a directory named train,
   ! mkdir train
unzip train data there,
   ! unzip train.zip -d train

更新：我建议安装 Google 驱动器

在尝试了两种方法（安装 Google 驱动器或直接从 Kaggle 加载之后）我建议安装 Google 驱动器，如果您的体系结构允许的话。这样做的好处是文件只需要上传一次：Google Colab 和 Google Drive 是直接连接的。安装 Google 驱动器需要额外的步骤来从 Kaggle 下载文件，解压缩并将其上传到 Google 驱动器，并为每个 Python 会话获取并激活令牌以安装 Google 驱动器，但激活令牌很快就完成了。使用 Kaggle，您需要在每次会话时将文件从 Kaggle 上传到 Google Colab，这需要更多时间和流量。

Answer 1

您可以编写一个脚本，只下载某些文件或一个接一个地下载文件：

import os

os.environ['KAGGLE_USERNAME'] = "YOUR_USERNAME_HERE"
os.environ['KAGGLE_KEY'] = "YOUR_TOKEN_HERE"

!kaggle datasets files allen-institute-for-ai/CORD-19-research-challenge

!kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge -f metadata.csv

如何将一个太大的 Kaggle 数据集的选定文件从 Kaggle 加载到 Colab

How to load just one chosen file of a way too large Kaggle dataset from Kaggle into Colab

api

download

dataset

kaggle

google-colaboratory

更新：我建议安装 Google 驱动器