在循环中发送请求时索引超出范围

Index out of range when sending requests in a loop

当我尝试循环获取GitHub项目的贡献者数量时遇到索引超出范围错误。经过一些迭代(运行良好)后,它只是抛出该异常。我不知道为什么...

    for x in range(100):
        r = requests.get('https://github.com/tipsy/profile-summary-for-github')  
        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
        print(contributors_number) # prints the correct number until the exception

例外情况。

----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range

您似乎收到了 429 - 请求过多,因为您一个接一个地触发请求。

您可能想这样修改您的代码:

import time

for index in range(100):
    r = requests.get('https://github.com/tipsy/profile-summary-for-github')  
    xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
    contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
    print(contributors_number)
    time.sleep(3) # Wait a bit before firing of another request

更好的是:

import time

for index in range(100):
    r = requests.get('https://github.com/tipsy/profile-summary-for-github')
    if r.status_code in [200]:  # Check if the request was successful  
        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
        print(contributors_number)
    else:
        print("Failed fetching page, status code: " + str(r.status_code))
    time.sleep(3) # Wait a bit before firing of another request

GitHub 正在阻止您的重复请求。不要快速连续抓取网站,许多网站运营商会主动阻止太多请求。因此,返回的内容不再匹配您的 XPath 查询。

您应该使用 REST API that GitHub provides 来检索项目统计信息,例如贡献者数量,并且您应该实施某种速率限制。无需检索相同的数字 100 次,贡献者计数不会快速变化。

API 响应 include information on how many requests you can make in a time window, and you can use conditional requests 仅在数据实际更改时才产生速率限制成本:

import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....'   # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
    # GitHub requests you use your username or appname in the header
    sess.headers['User-Agent'] += ' - {}'.format(github_username)
    # Consider logging in! You'll get more quota
    # sess.auth = (github_username, token)

    # start with the first, move to the last when available, include anonymous
    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

    while True:
        r = sess.get(last_page)
        if r.status_code == requests.codes.not_found:
            print("No such repo")
            break
        if r.status_code == requests.codes.no_content:
            print("No contributors, repository is empty")
            break
        if r.status_code == requests.codes.accepted:
            print("Stats not yet ready, retrying")
        elif r.status_code == requests.codes.not_modified:
            print("Stats not changed")
        elif r.ok:
            # success! Check for a last page, get that instead of current
            # to get accurate count
            link_last = r.links.get('last', {}).get('url')
            if link_last and r.url != link_last:
                last_page = link_last
            else:
                # this is the last page, report on count
                params = dict(parse_qsl(urlparse(r.url).query))
                page_num = int(params.get('page', '1'))
                per_page = int(params.get('per_page', '100'))
                contributor_count = len(r.json()) + (per_page * (page_num - 1))
                print("Contributor count:", contributor_count)
            # only get us a fresh response next time
            sess.headers['If-None-Match'] = r.headers['ETag']

        # pace ourselves following the rate limit
        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
        rate_remaining = int(r.headers['X-RateLimit-Remaining'])
        # sleep long enough to honour the rate limit or at least 100 milliseconds
        time.sleep(max(window_remaining / rate_remaining, 0.1))

以上使用请求 session object 来处理重复的 headers 并确保您尽可能重用连接。

github3.py 这样的好库(偶然由 requests 核心贡献者编写)将为您处理大部分细节。

如果您确实想坚持直接抓取网站,您确实要冒网站运营商完全阻止您的风险。不要不断地攻击网站,尝试承担一些责任。

这意味着 至少,你应该尊重 GitHub 在 429 上给你的 Retry-After header:

if not r.ok:
    print("Received a response other that 200 OK:", r.status_code, r.reason)
    retry_after = r.headers.get('Retry-After')
    if retry_after is not None:
        print("Response included a Retry-After:", retry_after)
        time.sleep(int(retry_after))
else:
    # parse OK response

现在这对我来说非常适合使用 API。可能是最干净的方式。

import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
    page_number += 1
    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
    response = requests.get(url)
    commits = json.loads(response.text)
    commits_total += len(commits)