如果评论太多,YouTube 评论提取器无限循环

YouTube comments extractor infinite loop if there is too many comments

我编写了一个脚本来提取 YouTube 的视频评论并将其存储在给定视频 ID 的文件中。如果视频的评论少于 10-15 条,则没有问题,脚本也能正常运行,但如果评论多了,就会陷入无限循环,我不明白为什么。

from googleapiclient.discovery import build 
import os
api_key = '...'

def video_comments(video_id): 
    # empty file for storing comments
    outputFile = open("comments_"+video_id+".txt", "w", encoding='utf-8')

    # empty dictionnary to store the data
    commentsDict = []

    # empty list for storing reply 
    replies = [] 

    # creating youtube resource object 
    youtube = build('youtube', 'v3', 
                    developerKey=api_key) 

    # retrieve youtube video results 
    video_response=youtube.commentThreads().list( 
    part='snippet,replies', 
    videoId=video_id 
    ).execute() 

    # iterate video response 
    while video_response: 
        
        # extracting required info 
        # from each result object 
        for item in video_response['items']: 
            # Extracting comments 
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay'] 
            commentEntrie = {"comment": comment, 'replies': []}
            
            # counting number of reply of comment 
            replycount = item['snippet']['totalReplyCount'] 

            # if reply is there 
            if replycount>0: 
                
                # iterate through all reply 
                for reply in item['replies']['comments']: 
                    
                    # Extract reply 
                    reply = reply['snippet']['textDisplay'] 
                    
                    # Store reply is list 
                    replies.append(reply) 
                    commentEntrie['replies'].append(reply)
                    
            # print comment with list of reply 
            print(comment, replies, end = '\n\n')
            outputFile.write("%s" % comment)
            outputFile.write("%s\n" % replies)
            commentsDict.append(commentEntrie)
            # empty reply list 
            replies = [] 

        # Again repeat 
        if 'nextPageToken' in video_response: 
            video_response = youtube.commentThreads().list( 
                    part = 'snippet,replies', 
                    videoId = video_id 
                ).execute() 
        else: 
            break
    outputFile.close()
    print(commentsDict)

# Enter video id 
video_id = "aDHYbM9OqUc" 

# Call function 
video_comments(video_id)  

我可以提供两个视频 ID,这个 LVgKlfw4DHc 工作正常但是这个以无限循环结束 aDHYbM9OqUc 有什么想法吗?

[编辑] 我觉得 nextPageToken 一直都在这里,并且在

时无限延伸

你的循环 while video_response: 因为这段代码而变得无限:

if 'nextPageToken' in video_response: 
    video_response = youtube.commentThreads().list( 
        part = 'snippet,replies', 
        videoId = video_id 
    ).execute() 
else: 
    break

如果第一个 video_response 包含 属性 nextPageToken,那么循环 中对 CommentThreads.list 的调用完全相同 作为循环外的那个。因此,通过第二次调用,您得到的 与上一次调用获得的video_response 完全相同

正确的实施方式是:

if 'nextPageToken' in video_response: 
    video_response = youtube.commentThreads().list( 
        pageToken = video_response['nextPageToken'],
        part = 'snippet,replies', 
        videoId = video_id 
    ).execute() 
else: 
    break

由于您使用的是 Google 的 APIs Client Library for Python, the pythonic way of implementing result set pagination on the CommentThreads.list API 端点如下所示:

request = youtube.commentThreads().list(
    part = 'snippet,replies', 
    videoId = video_id 
)

while request:
    response = request.execute()

    for item in response['items']:
        ...

    request = youtube.commentThreads().list_next(
        request, response)

之所以如此简单,是因为 Python 客户端库的实现方式:无需显式处理 API 响应对象的 属性 nextPageToken 和API 根本没有请求参数 pageToken