使用 Python 有效地请求和处理多个 HTML 文件
Effectively requesting and processing multiple HTML files with Python
我正在编写一个工具来获取多个 HTML 文件并将它们作为文本处理:
for url in url_list:
url_response = requests.get(url)
text = url_response.text
# Process text here (put in database, search, etc)
问题是这太慢了。如果我只需要一个简单的响应,我可以使用 grequests,但由于我需要获取 HTML 文件的内容,这似乎不是一个选项。我怎样才能把它系好?
谢谢!
您需要使用线程并放置 requests.get(...)
来获取差异线程中的每个 URL,即并行。
查看关于 SO 的这两个答案的示例和用法:
- Python - very simple multithreading parallel URL fetching (without queue)
为每个请求使用一个线程:
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
import requests
from multiprocessing import Pool
def process_html(url):
url_response = requests.get(url)
text = url_response.text
print(text[:500])
print('-' * 30)
urls = [
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
]
with Pool(None) as p: #None => uses cpu.count()
p.map(process_html, urls) #This blocks until all return values from process_html() have been collected.
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
我正在编写一个工具来获取多个 HTML 文件并将它们作为文本处理:
for url in url_list:
url_response = requests.get(url)
text = url_response.text
# Process text here (put in database, search, etc)
问题是这太慢了。如果我只需要一个简单的响应,我可以使用 grequests,但由于我需要获取 HTML 文件的内容,这似乎不是一个选项。我怎样才能把它系好?
谢谢!
您需要使用线程并放置 requests.get(...)
来获取差异线程中的每个 URL,即并行。
查看关于 SO 的这两个答案的示例和用法:
- Python - very simple multithreading parallel URL fetching (without queue)
为每个请求使用一个线程:
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
import requests
from multiprocessing import Pool
def process_html(url):
url_response = requests.get(url)
text = url_response.text
print(text[:500])
print('-' * 30)
urls = [
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
]
with Pool(None) as p: #None => uses cpu.count()
p.map(process_html, urls) #This blocks until all return values from process_html() have been collected.
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()