如果评论太多,YouTube 评论提取器无限循环
YouTube comments extractor infinite loop if there is too many comments
我编写了一个脚本来提取 YouTube 的视频评论并将其存储在给定视频 ID 的文件中。如果视频的评论少于 10-15 条,则没有问题,脚本也能正常运行,但如果评论多了,就会陷入无限循环,我不明白为什么。
from googleapiclient.discovery import build
import os
api_key = '...'
def video_comments(video_id):
# empty file for storing comments
outputFile = open("comments_"+video_id+".txt", "w", encoding='utf-8')
# empty dictionnary to store the data
commentsDict = []
# empty list for storing reply
replies = []
# creating youtube resource object
youtube = build('youtube', 'v3',
developerKey=api_key)
# retrieve youtube video results
video_response=youtube.commentThreads().list(
part='snippet,replies',
videoId=video_id
).execute()
# iterate video response
while video_response:
# extracting required info
# from each result object
for item in video_response['items']:
# Extracting comments
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
commentEntrie = {"comment": comment, 'replies': []}
# counting number of reply of comment
replycount = item['snippet']['totalReplyCount']
# if reply is there
if replycount>0:
# iterate through all reply
for reply in item['replies']['comments']:
# Extract reply
reply = reply['snippet']['textDisplay']
# Store reply is list
replies.append(reply)
commentEntrie['replies'].append(reply)
# print comment with list of reply
print(comment, replies, end = '\n\n')
outputFile.write("%s" % comment)
outputFile.write("%s\n" % replies)
commentsDict.append(commentEntrie)
# empty reply list
replies = []
# Again repeat
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
outputFile.close()
print(commentsDict)
# Enter video id
video_id = "aDHYbM9OqUc"
# Call function
video_comments(video_id)
我可以提供两个视频 ID,这个 LVgKlfw4DHc
工作正常但是这个以无限循环结束 aDHYbM9OqUc
有什么想法吗?
[编辑] 我觉得 nextPageToken
一直都在这里,并且在
时无限延伸
你的循环 while video_response:
因为这段代码而变得无限:
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
如果第一个 video_response
包含 属性 nextPageToken
,那么循环 中对 CommentThreads.list
的调用完全相同 作为循环外的那个。因此,通过第二次调用,您得到的 与上一次调用获得的video_response
完全相同。
正确的实施方式是:
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
pageToken = video_response['nextPageToken'],
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
由于您使用的是 Google 的 APIs Client Library for Python, the pythonic way of implementing result set pagination on the CommentThreads.list
API 端点如下所示:
request = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
)
while request:
response = request.execute()
for item in response['items']:
...
request = youtube.commentThreads().list_next(
request, response)
之所以如此简单,是因为 Python 客户端库的实现方式:无需显式处理 API 响应对象的 属性 nextPageToken
和API 根本没有请求参数 pageToken
。
我编写了一个脚本来提取 YouTube 的视频评论并将其存储在给定视频 ID 的文件中。如果视频的评论少于 10-15 条,则没有问题,脚本也能正常运行,但如果评论多了,就会陷入无限循环,我不明白为什么。
from googleapiclient.discovery import build
import os
api_key = '...'
def video_comments(video_id):
# empty file for storing comments
outputFile = open("comments_"+video_id+".txt", "w", encoding='utf-8')
# empty dictionnary to store the data
commentsDict = []
# empty list for storing reply
replies = []
# creating youtube resource object
youtube = build('youtube', 'v3',
developerKey=api_key)
# retrieve youtube video results
video_response=youtube.commentThreads().list(
part='snippet,replies',
videoId=video_id
).execute()
# iterate video response
while video_response:
# extracting required info
# from each result object
for item in video_response['items']:
# Extracting comments
comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
commentEntrie = {"comment": comment, 'replies': []}
# counting number of reply of comment
replycount = item['snippet']['totalReplyCount']
# if reply is there
if replycount>0:
# iterate through all reply
for reply in item['replies']['comments']:
# Extract reply
reply = reply['snippet']['textDisplay']
# Store reply is list
replies.append(reply)
commentEntrie['replies'].append(reply)
# print comment with list of reply
print(comment, replies, end = '\n\n')
outputFile.write("%s" % comment)
outputFile.write("%s\n" % replies)
commentsDict.append(commentEntrie)
# empty reply list
replies = []
# Again repeat
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
outputFile.close()
print(commentsDict)
# Enter video id
video_id = "aDHYbM9OqUc"
# Call function
video_comments(video_id)
我可以提供两个视频 ID,这个 LVgKlfw4DHc
工作正常但是这个以无限循环结束 aDHYbM9OqUc
有什么想法吗?
[编辑] 我觉得 nextPageToken
一直都在这里,并且在
你的循环 while video_response:
因为这段代码而变得无限:
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
如果第一个 video_response
包含 属性 nextPageToken
,那么循环 中对 CommentThreads.list
的调用完全相同 作为循环外的那个。因此,通过第二次调用,您得到的 与上一次调用获得的video_response
完全相同。
正确的实施方式是:
if 'nextPageToken' in video_response:
video_response = youtube.commentThreads().list(
pageToken = video_response['nextPageToken'],
part = 'snippet,replies',
videoId = video_id
).execute()
else:
break
由于您使用的是 Google 的 APIs Client Library for Python, the pythonic way of implementing result set pagination on the CommentThreads.list
API 端点如下所示:
request = youtube.commentThreads().list(
part = 'snippet,replies',
videoId = video_id
)
while request:
response = request.execute()
for item in response['items']:
...
request = youtube.commentThreads().list_next(
request, response)
之所以如此简单,是因为 Python 客户端库的实现方式:无需显式处理 API 响应对象的 属性 nextPageToken
和API 根本没有请求参数 pageToken
。