使用 Python 有效地请求和处理多个 HTML 文件

Question

我正在编写一个工具来获取多个 HTML 文件并将它们作为文本处理：

for url in url_list:
    url_response = requests.get(url)
    text = url_response.text
    # Process text here (put in database, search, etc)

问题是这太慢了。如果我只需要一个简单的响应，我可以使用 grequests，但由于我需要获取 HTML 文件的内容，这似乎不是一个选项。我怎样才能把它系好？

谢谢！

Answer 1

您需要使用线程并放置 requests.get(...) 来获取差异线程中的每个 URL，即并行。

查看关于 SO 的这两个答案的示例和用法：

Python - very simple multithreading parallel URL fetching (without queue)

Answer 2

为每个请求使用一个线程：

import threading
import urllib2

url_list = ["url1", "url2"]

def fetch_url(url):
    url_response = requests.get(url)
    text = url_response.text

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

Answer 3

import requests
from multiprocessing import Pool

def process_html(url):
    url_response = requests.get(url)
    text = url_response.text
    print(text[:500])
    print('-' * 30)

urls = [
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
]

with Pool(None) as p:  #None => uses cpu.count()
    p.map(process_html, urls)  #This blocks until all return values from process_html() have been collected.

Answer 4

import threading
import urllib2

url_list = ["url1", "url2"]

def fetch_url(url):
    url_response = requests.get(url)
    text = url_response.text

    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
    for thread in threads:
          thread.start()
    for thread in threads:
         thread.join()

使用 Python 有效地请求和处理多个 HTML 文件

Effectively requesting and processing multiple HTML files with Python

python

html

url

python-requests

fetch