Python 从页面上的链接下载多个文件

Question

我正在尝试下载所有 PGNs from this site。

我想我必须使用 urlopen 打开每个 url 然后使用 urlretrieve 从每个游戏底部附近的下载按钮访问它来下载每个 pgn。我是否必须为每个游戏创建一个新的 BeautifulSoup 对象？我也不确定 urlretrieve 是如何工作的。

import urllib
from urllib.request import urlopen, urlretrieve, quote
from bs4 import BeautifulSoup

url = 'http://www.chessgames.com/perl/chesscollection?cid=1014492'
u = urlopen(url)
html = u.read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a'):
    urlopen('http://chessgames.com'+link.get('href'))

Answer 1

您的问题没有简短的答案。我将向您展示一个完整的解决方案并注释这段代码。

首先，导入必要的模块：

from bs4 import BeautifulSoup
import requests
import re

接下来，获取索引页并创建BeautifulSoup对象：

req = requests.get("http://www.chessgames.com/perl/chesscollection?cid=1014492")
soup = BeautifulSoup(req.text, "lxml")

我强烈建议使用 lxml 解析器，不常用 html.parser 之后，您应该准备游戏的链接列表：

pages = soup.findAll('a', href=re.compile('.*chessgame\?.*'))

您可以通过搜索包含 'chessgame' 单词的链接来完成。现在，您应该准备为您下载文件的函数：

def download_file(url):
    path = url.split('/')[-1].split('?')[0]
    r = requests.get(url, stream=True)
    if r.status_code == 200:
        with open(path, 'wb') as f:
            for chunk in r:
                f.write(chunk)

最后的魔法是重复之前的所有步骤，为文件下载器准备链接：

host = 'http://www.chessgames.com'
for page in pages:
    url = host + page.get('href')
    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")
    file_link = soup.find('a',text=re.compile('.*download.*'))
    file_url = host + file_link.get('href')
    download_file(file_url)

（首先搜索描述中包含文本 'download' 的链接，然后构建完整的 url - 连接主机名和路径，最后下载文件）

希望您能不加修改使用此代码！

Answer 2

is fantastic but the task is embarrassingly parallel；无需一次一个地检索这些子页面和文件。这个答案展示了如何加快速度。

第一步是使用 requests 文档中的 requests.Session() when sending multiple requests to a single host. Quoting Advanced Usage: Session Objects：

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).

接下来，asyncio、多处理或多线程可用于并行化工作负载。每个都根据手头的任务进行权衡，您选择哪个可能最好通过基准测试和分析来确定。 This page 提供了这三者的很好的例子。

出于此 post 的目的，我将展示多线程。 GIL 的影响不应该成为太大的瓶颈，因为任务主要是 IO 绑定的，包括广播中的保姆请求以等待响应。当一个线程在 IO 上被阻塞时，它可以让出一个线程来解析 HTML 或做其他 CPU 绑定的工作。

代码如下：

import os
import re
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor

def download_pgn(task):
    session, host, page, destination_path = task
    response = session.get(host + page)
    response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")
    game_url = host + soup.find("a", text="download").get("href")
    filename = re.search(r"\w+\.pgn", game_url).group()
    path = os.path.join(destination_path, filename)
    response = session.get(game_url, stream=True)
    response.raise_for_status()

    with open(path, "wb") as f:
        for chunk in response.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)

if __name__ == "__main__":
    host = "http://www.chessgames.com"
    url_to_scrape = host + "/perl/chesscollection?cid=1014492"
    destination_path = "pgns"
    max_workers = 8

    if not os.path.exists(destination_path):
        os.makedirs(destination_path)

    with requests.Session() as session:
        response = session.get(url_to_scrape)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "lxml")
        pages = soup.find_all("a", href=re.compile(r".*chessgame\?.*"))
        tasks = [
            (session, host, page.get("href"), destination_path) 
            for page in pages
        ]

        with ThreadPoolExecutor(max_workers=max_workers) as pool:
            pool.map(download_pgn, tasks)

我在这里使用了 response.iter_content，这对于如此小的文本文件来说是不必要的，但它是一种概括，因此代码将以内存友好的方式处理更大的文件。

粗略基准测试的结果（第一个请求是瓶颈）：

max workers	session?	seconds
1	no	126
1	yes	111
8	no	24
8	yes	22
32	yes	16

Python 从页面上的链接下载多个文件

Python download multiple files from links on pages

python

urllib

beautifulsoup

python-3.x