How do I avoid getting a sporadic KeyError: 'data' when using the Reddit API in python?

Question

我有以下 python 代码可以正常使用 reddit 的 api 并查看不同 subreddits 的首页及其不断增加的提交。

from pprint import pprint
import requests
import json
import datetime
import csv
import time

subredditsToScan = ["Arts", "AskReddit", "askscience", "aww", "books", "creepy", "dataisbeautiful", "DIY", "Documentaries", "EarthPorn", "explainlikeimfive", "food", "funny", "gaming", "gifs", "history", "jokes", "LifeProTips", "movies", "music", "pics", "science", "ShowerThoughts", "space", "sports", "tifu", "todayilearned", "videos", "worldnews"]

ofilePosts = open('posts.csv', 'wb')
writerPosts = csv.writer(ofilePosts, delimiter=',')

ofileUrls = open('urls.csv', 'wb')
writerUrls = csv.writer(ofileUrls, delimiter=',')

for subreddit in subredditsToScan:
    front = requests.get(r'http://www.reddit.com/r/' + subreddit + '/.json')
    rising = requests.get(r'http://www.reddit.com/r/' + subreddit + '/rising/.json')

    front.text
    rising.text

    risingData = rising.json()
    frontData = front.json()

    print(len(risingData['data']['children']))
    print(len(frontData['data']['children']))
    for i in range(0, len(risingData['data']['children'])):
        author = risingData['data']['children'][i]['data']['author']
        score = risingData['data']['children'][i]['data']['score']
        subreddit = risingData['data']['children'][i]['data']['subreddit']
        gilded = risingData['data']['children'][i]['data']['gilded']
        numOfComments = risingData['data']['children'][i]['data']['num_comments']
        linkUrl = risingData['data']['children'][i]['data']['permalink']
        timeCreated = risingData['data']['children'][i]['data']['created_utc']

        writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
        writerUrls.writerow([linkUrl])



    for j in range(0, len(frontData['data']['children'])):
        author = frontData['data']['children'][j]['data']['author'].encode('utf-8').strip()
        score = frontData['data']['children'][j]['data']['score']
        subreddit = frontData['data']['children'][j]['data']['subreddit'].encode('utf-8').strip()
        gilded = frontData['data']['children'][j]['data']['gilded']
        numOfComments = frontData['data']['children'][j]['data']['num_comments']
        linkUrl = frontData['data']['children'][j]['data']['permalink'].encode('utf-8').strip()
        timeCreated = frontData['data']['children'][j]['data']['created_utc']

        writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
        writerUrls.writerow([linkUrl])

它运行良好并准确地抓取数据，但它不断地被打断，似乎是随机的，并且运行时间崩溃，说：

Traceback (most recent call last):
  File "dataGather1.py", line 27, in <module>
    for i in range(0, len(risingData['data']['children'])):
KeyError: 'data'

我不知道为什么这个错误时断时续发生，而不是始终如一。我想也许我调用 API 太多了，所以它阻止了我访问它，所以我在我的代码中睡了一觉，但这没有帮助。有什么想法吗？

Answer 1

当 API 的响应中没有数据时，字典中就没有关键数据，因此您会在某些 subreddits 上得到 keyError。你需要使用 try catch

Answer 2

您正在解析的 json 不包含 'data' 元素。因此你会得到一个错误。我认为你的直觉是正确的。这可能是速率限制，或者您要求 hidden/deleted 条目。

Reddit 对访问他们的 API 没有表现得很好非常严格。这意味着您应该注册您的应用程序并对您的请求使用有意义的用户代理，并且您可能应该使用 python 库来处理这种事情：https://praw.readthedocs.io/en/latest/

根据我的经验，在没有注册的情况下，直接 REST reddit API 甚至比他们拥有的每 2 秒 1 个请求规则更严格（曾经？）。

Answer 3

Python 每当请求 dict() 对象（使用格式 a = adict[key]）并且键不在字典中时，就会引发 KeyError。

似乎当您收到此错误时，您的数据值为空。

您可能只是在执行 for 循环之前尝试获取字典的长度。如果它是空的，它就不会运行。此处进行一些有趣的错误检查可能会有所帮助。

size = len(risingData)
if size:
    for i in range(0,size):
    …

How do I avoid getting a sporadic KeyError: 'data' when using the Reddit API in python?

How do I avoid getting a sporadic KeyError: 'data' when using the Reddit API in python?

python

error-handling

runtime

reddit