自动下载 public GDrive 文件夹中的大文件

Question

我的最终目标是使用 python（例如 gdown）自动下载 public GDrive 文件夹中的所有文件（每个文件大到 3G）。经过大量尝试，我终于找到了一种使用 Google 工作表中的 Google 脚本从文件夹中提取所有 link 的方法，所以我确实拥有所有 link 的所有内容我需要以这种格式下载的文件：

https://drive.google.com/file/d/IDA/view?usp=drivesdk&resourcekey=otherIDA
https://drive.google.com/file/d/IDB/view?usp=drivesdk&resourcekey=otherIDB
https://drive.google.com/file/d/IDC/view?usp=drivesdk&resourcekey=otherIDC
...
https://drive.google.com/file/d/IDZ/view?usp=drivesdk&resourcekey=otherIDZ

然后我想用 for 循环遍历 links 以下载所有文件：

import gdown
import re
regex = "([\w-]){33}|([\w-]){19}"
download_url_basename = "https://drive.google.com/uc?export=download&id="
for i, l in enumerate(links_to_download):
    file_id = re.search(regex, url)[0]
    gdown.download(download_url_basename + file_id, f"file_{i}")

但是我遇到了：

Permission denied: https://drive.google.com/uc?id=ID
Maybe you need to change permission over 'Anyone with the link'?

这是一个 public 存储库，所以虽然我可以访问它并且有足够的权限手动下载每个文件，但我只能在查看模式下获得可共享的 links。

有没有办法自动将 link 转换为可以自动下载的内容？是故意屏蔽的吗？有什么方法可以自动完成而不是手动下载 400 个文件吗？

编辑： 略有相关，但此问题并非源于同一个问题，也没有提供自动处理任何问题的方法。

编辑 2： 我使用 google 驱动器 API python SDK，使用 Google 控制台生成服务帐户，激活 OAuth2 并生成 OAuth2 json 凭据以构建 drive_service 对象：

from google_auth_oauthlib.flow import Flow, InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload, MediaIoBaseDownload
from google.auth.transport.requests import Request
import io
import re
SCOPES = ['https://www.googleapis.com/auth/drive']
CLIENT_SECRET_FILE = "myjson.json"
authorized_port = 6006 # authorize URI redirect on the console
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRET_FILE, SCOPES)
cred = flow.run_local_server(port=authorized_port)
drive_service = build("drive", "v3", credentials=cred)
download_url_basename = "https://drive.google.com/uc?id="
regex = "([\w-]){33}|([\w-]){19}"
for i, l in enumerate(links_to_download):
    url = l
    file_id = re.search(regex, url)[0]
    request = drive_service.files().get_media(fileId=file_id)
    fh = io.BytesIO()
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print("Download %d%%." % int(status.progress() * 100))

但是我现在得到：

googleapiclient.errors.HttpError: <HttpError 404 when requesting https://www.googleapis.com/drive/v3/files/fileId?alt=media returned "File not found: fileID.". Details: "[{'domain': 'global', 'reason': 'notFound', 'message': 'File not found: fileId.', 'locationType': 'parameter', 'location': 'fileId'}]">

找到一个相关的有什么想法吗？

Answer 1

好的，多亏了 Google API，我终于能够让它工作了！

从获取文件夹内的链接列表到下载它们的整个过程太麻烦了我可能有一天会写一篇博客post:

from google_auth_oauthlib.flow import Flow, InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.http import MediaFileUpload, MediaIoBaseDownload
from google.auth.transport.requests import Request
import io
import re
SCOPES = ['https://www.googleapis.com/auth/drive']
CLIENT_SECRET_FILE = "myjson.json"
authorized_port = 6006 # authorize URI redirect on the console
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRET_FILE, SCOPES)
cred = flow.run_local_server(port=authorized_port)
drive_service = build("drive", "v3", credentials=cred)
regex = "(?<=https://drive.google.com/file/d/)[a-zA-Z0-9]+"
for i, l in enumerate(links_to_download):
    url = l
    file_id = re.search(regex, url)[0]
    request = drive_service.files().get_media(fileId=file_id)
    fh = io.FileIO(f"file_{i}", mode='wb')
    downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print("Download %d%%." % int(status.progress() * 100))

自动下载 public GDrive 文件夹中的大文件

Automatically download large files in public GDrive folder

python

web-scraping

google-drive-api

gdrive