Python 多线程请求

Python requests with multithreading

这两天我一直在尝试构建一个具有多线程功能的抓取工具。不知何故,我还是做不到。起初我尝试使用线程模块的常规多线程方法,但它并不比使用单线程快。后来我了解到请求是阻塞的,多线程方法并没有真正起作用。所以我一直在研究并发现了有关grequests和gevent的信息。现在我正在 运行 使用 gevent 进行测试,它仍然不比使用单线程快。我的编码有误吗?

这是我class的相关部分:

import gevent.monkey
from gevent.pool import Pool
import requests

gevent.monkey.patch_all()

class Test:
    def __init__(self):
        self.session = requests.Session()
        self.pool = Pool(20)
        self.urls = [...urls...]

    def fetch(self, url):

        try:
            response = self.session.get(url, headers=self.headers)
        except:
            self.logger.error('Problem: ', id, exc_info=True)

        self.doSomething(response)

    def async(self):
        for url in self.urls:
            self.pool.spawn( self.fetch, url )

        self.pool.join()

test = Test()
test.async()

安装与 gevent 一起使用的 grequests modulerequests 不是为异步设计的):

pip install grequests

然后把代码改成这样:

import grequests

class Test:
    def __init__(self):
        self.urls = [
            'http://www.example.com',
            'http://www.google.com', 
            'http://www.yahoo.com',
            'http://www.whosebug.com/',
            'http://www.reddit.com/'
        ]

    def exception(self, request, exception):
        print "Problem: {}: {}".format(request.url, exception)

    def async(self):
        results = grequests.map((grequests.get(u) for u in self.urls), exception_handler=self.exception, size=5)
        print results

test = Test()
test.async()

这是 requests 项目的 officially recommended

Blocking Or Non-Blocking?

With the default Transport Adapter in place, Requests does not provide any kind of non-blocking IO. The Response.content property will block until the entire response has been downloaded. If you require more granularity, the streaming features of the library (see Streaming Requests) allow you to retrieve smaller quantities of the response at a time. However, these calls will still block.

If you are concerned about the use of blocking IO, there are lots of projects out there that combine Requests with one of Python's asynchronicity frameworks. Two excellent examples are grequests and requests-futures.

使用此方法可以显着提高 10 个 URL 的性能:0.877s 对比 3.852s 使用您的原始方法。