Python 多线程请求
Python requests with multithreading
这两天我一直在尝试构建一个具有多线程功能的抓取工具。不知何故,我还是做不到。起初我尝试使用线程模块的常规多线程方法,但它并不比使用单线程快。后来我了解到请求是阻塞的,多线程方法并没有真正起作用。所以我一直在研究并发现了有关grequests和gevent的信息。现在我正在 运行 使用 gevent 进行测试,它仍然不比使用单线程快。我的编码有误吗?
这是我class的相关部分:
import gevent.monkey
from gevent.pool import Pool
import requests
gevent.monkey.patch_all()
class Test:
def __init__(self):
self.session = requests.Session()
self.pool = Pool(20)
self.urls = [...urls...]
def fetch(self, url):
try:
response = self.session.get(url, headers=self.headers)
except:
self.logger.error('Problem: ', id, exc_info=True)
self.doSomething(response)
def async(self):
for url in self.urls:
self.pool.spawn( self.fetch, url )
self.pool.join()
test = Test()
test.async()
安装与 gevent
一起使用的 grequests
module(requests
不是为异步设计的):
pip install grequests
然后把代码改成这样:
import grequests
class Test:
def __init__(self):
self.urls = [
'http://www.example.com',
'http://www.google.com',
'http://www.yahoo.com',
'http://www.whosebug.com/',
'http://www.reddit.com/'
]
def exception(self, request, exception):
print "Problem: {}: {}".format(request.url, exception)
def async(self):
results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
print results
test = Test()
test.async()
这是 requests
项目的 officially recommended:
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content
property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.
If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests
and requests-futures
.
使用此方法可以显着提高 10 个 URL 的性能:0.877s
对比 3.852s
使用您的原始方法。
这两天我一直在尝试构建一个具有多线程功能的抓取工具。不知何故,我还是做不到。起初我尝试使用线程模块的常规多线程方法,但它并不比使用单线程快。后来我了解到请求是阻塞的,多线程方法并没有真正起作用。所以我一直在研究并发现了有关grequests和gevent的信息。现在我正在 运行 使用 gevent 进行测试,它仍然不比使用单线程快。我的编码有误吗?
这是我class的相关部分:
import gevent.monkey
from gevent.pool import Pool
import requests
gevent.monkey.patch_all()
class Test:
def __init__(self):
self.session = requests.Session()
self.pool = Pool(20)
self.urls = [...urls...]
def fetch(self, url):
try:
response = self.session.get(url, headers=self.headers)
except:
self.logger.error('Problem: ', id, exc_info=True)
self.doSomething(response)
def async(self):
for url in self.urls:
self.pool.spawn( self.fetch, url )
self.pool.join()
test = Test()
test.async()
安装与 gevent
一起使用的 grequests
module(requests
不是为异步设计的):
pip install grequests
然后把代码改成这样:
import grequests
class Test:
def __init__(self):
self.urls = [
'http://www.example.com',
'http://www.google.com',
'http://www.yahoo.com',
'http://www.whosebug.com/',
'http://www.reddit.com/'
]
def exception(self, request, exception):
print "Problem: {}: {}".format(request.url, exception)
def async(self):
results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
print results
test = Test()
test.async()
这是 requests
项目的 officially recommended:
Blocking Or Non-Blocking?
With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The
Response.content
property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are
grequests
andrequests-futures
.
使用此方法可以显着提高 10 个 URL 的性能:0.877s
对比 3.852s
使用您的原始方法。