Youtube 数据 Api 页面令牌问题 (python)

Youtube Data Api Page Token Question (python)

我尝试下载 2019 年的视频元数据。每次我 运行 我的代码都超过配额限制。那段时间我有不到 100 个视频。谁能告诉我更好的代码编写方法?

   try: 
    request = youtube.search().list(
        part = 'id, snippet',
        type = 'video',
        publishedAfter = '2018-12-31T23:59:59Z',
        publishedBefore = '2020-01-01T00:00:00Z',
        order = 'date',
        fields = 'nextPageToken,items(id,snippet)',
        pageToken = None,
        maxResults = 50
    )
    response = request.execute()
    nextPageToken = None

    while True:
        request = youtube.search().list(
        pageToken = nextPageToken,
        part = 'id, snippet',
        type = 'video',
        fields = 'nextPageToken,items(id,snippet)',
        maxResults = 50
        )

        response = request.execute()
        nextPageToken = response['nextPageToken']
        items = response['items']
        if response['nextPageToken'] == None:
            break
        for each_item in items:
            video_id = each_item['id']['videoId']
            sub_items = each_item['snippet']
            for sub_item in sub_items:
                video_item[sub_item] = sub_items[sub_item ]

            video_data[video_id] = video_item
except Exception as e:
    print('Error in get_video_data: {0}'.format(e))

谢谢!

请确认您对 Search.list 端点的 API 调用是 运行 针对那一年的整套 YouTube 视频时期;您的 API 调用未指定任何其他过滤条件,这意味着您的查询(基于分页)将 可能 return 数百万个视频条目 .

如果您实际上是在寻找自己的视频,那么您的 Search.list 端点调用应包含 forMine or the channelId 请求参数:

  • 当您从 discovery.build method using its parameter credentials (that is you're issuing an authorized request), then use the request parameter forMine 构建 youtube 对象时,如下所示:
request = youtube.search().list(
    forMine = True,
    part = 'id,snippet',
    type = 'video',
    publishedAfter = '2018-12-31T23:59:59Z',
    publishedBefore = '2020-01-01T00:00:00Z',
    order = 'date',
    fields = 'nextPageToken,items(id,snippet)',
    maxResults = 50
)

请注意,根据下面 更新和修复部分下记录的调查结果,此替代方案被证明是不可行的。

  • 当您从 discovery.build method using its parameter developerKey (that is you're not issuing an authorized request), then use the request parameter channelId 构建 youtube 对象时,如下所示:
request = youtube.search().list(
    channelId = CHANNEL_ID,
    part = 'id,snippet',
    type = 'video',
    publishedAfter = '2018-12-31T23:59:59Z',
    publishedBefore = '2020-01-01T00:00:00Z',
    order = 'date',
    fields = 'nextPageToken,items(id,snippet)',
    maxResults = 50
)

请注意,CHANNEL_ID 是您的频道(或与此相关的任何其他频道)的 ID。

上述两种 API 调用的区别如下:发出授权请求时(上面第一个项目符号),您将获得您频道的所有视频,包括那些非public(即那些将 privacyStatus set to private or unlisted); on the other hand, when using an API key (the second bullet above), you'll get only the public videos (i.e. those that have their privacyStatus 设置为 public 的频道),即使 CHANNEL_ID 是您自己频道的 ID。


现在,不幸的是,您上面的代码还有另一个问题:您的两个 Search.list 端点调用不相同,取模 pageToken 请求参数。那是因为第二次调用没有拿到请求参数publishedAfterpublishedBefore.

这种差异意味着您没有正确分页第一个 API 调用的结果集(实际上,即使将参数 pageToken 传递给第二个 API 调用) .

幸运的是,您正在使用的 Google 的 APIs 客户端库 Python 实现了 API result set pagination in a simple pythonic way(我将在下面举例说明上面的第二个项目符号):

request = youtube.search().list(
    channelId = CHANNEL_ID,
    part = 'id,snippet',
    type = 'video',
    publishedAfter = '2018-12-31T23:59:59Z',
    publishedBefore = '2020-01-01T00:00:00Z',
    order = 'date',
    fields = 'nextPageToken,items(id,snippet)',
    maxResults = 50
)
video_data = {}

while request:
    response = request.execute()

    for item in response['items']:
        video_id = item['id']['videoId']
        video_item = item['snippet']
        video_data[video_id] = video_item

    request = youtube.search().list_next(
        request, response)

上面的代码表明没有必要完全重复第一个 API 调用,添加一个 pageToken 参数;有更简单的语句就足够了:

    request = youtube.search().list_next(
        request, response)

此语句使用 response 对象的 nextPageToken 属性 的值从旧的 request 对象构造一个具有正确设置的新对象 pageToken 属性.


更新和修复

在进一步测试和调查关于使用请求参数 forMinepublishedAfterpublishedBefore 调用 Search.list 后,我得出以下结论结论:

  • 没有任何参数 publishedAfterpublishedBefore 的参数 forMine=True 使 API 调用按预期工作;

  • 参数 forMine=True 与任何参数 publishedAfterpublishedBefore 或两者一起给出会产生 HTTP 错误 400 Bad Request 以及JSON 错误响应:

{
  "error": {
    "code": 400,
    "message": "Request contains an invalid argument.",
    "errors": [
      {
        "message": "Request contains an invalid argument.",
        "domain": "global",
        "reason": "badRequest"
      }
    ],
    "status": "INVALID_ARGUMENT"
  }
}

Google 自己的问题跟踪器记录 a very recent bug report that describes precisely the behavior above. The official response from Google's staff 如下:

Status: Won't Fix (Intended Behavior)

This is working as intended. Basically you can only set one of the resource filters if it's a for_content_owner request, but both channel ID and published after are resource filters. This requirement doesn't seem to be specified on the developer website: https://developers.google.com/youtube/v3/docs/search/list.