使用 Python 有效地请求和处理多个 HTML 文件

Effectively requesting and processing multiple HTML files with Python

我正在编写一个工具来获取多个 HTML 文件并将它们作为文本处理:

for url in url_list:
    url_response = requests.get(url)
    text = url_response.text
    # Process text here (put in database, search, etc)

问题是这太慢了。如果我只需要一个简单的响应,我可以使用 grequests,但由于我需要获取 HTML 文件的内容,这似乎不是一个选项。我怎样才能把它系好?

谢谢!

您需要使用线程并放置 requests.get(...) 来获取差异线程中的每个 URL,即并行。

查看关于 SO 的这两个答案的示例和用法:

  • Python - very simple multithreading parallel URL fetching (without queue)

为每个请求使用一个线程:

import threading
import urllib2

url_list = ["url1", "url2"]

def fetch_url(url):
    url_response = requests.get(url)
    text = url_response.text

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()
import requests
from multiprocessing import Pool

def process_html(url):
    url_response = requests.get(url)
    text = url_response.text
    print(text[:500])
    print('-' * 30)

urls = [
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
    'http://www.apple.com',
    'http://www.yahoo.com',
    'http://www.google.com',
]

with Pool(None) as p:  #None => uses cpu.count()
    p.map(process_html, urls)  #This blocks until all return values from process_html() have been collected.
import threading
import urllib2

url_list = ["url1", "url2"]

def fetch_url(url):
    url_response = requests.get(url)
    text = url_response.text

    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
    for thread in threads:
          thread.start()
    for thread in threads:
         thread.join()