Scrapy with Privoxy 和 Tor:如何更新 IP

Scrapy with Privoxy and Tor: how to renew IP

我正在处理 Scrapy、Privoxy 和 Tor。我已经全部安装并正常工作。但是 Tor 每次都连接同一个 IP,所以我很容易被禁止。是否可以告诉 Tor 每 X 秒或连接重新连接?

谢谢!

编辑配置: 对于用户代理池,我这样做了:http://tangww.com/2013/06/UsingRandomAgent/ (I had to put a _ init _.py file as it is said in the comments), and for the Privoxy and Tor I followed http://www.andrewwatters.com/privoxy/(我必须使用终端手动创建私有用户和私有组)。它奏效了:)

我的蜘蛛是这样的:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "spider_name"
    start_urls = [
    'https://example.com/listviews/titles.php',
    ]
    allowed_domains = ["example.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="tab7"]/article/header/h2/a/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('ul.pagin li.presente ~ li a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('main#main'):
            yield {
                'editor': each_book.css('header.datos1 > ul > li > h5 > a::text').extract(),
            }

在 settings.py 我有一个用户代理旋转和 privoxy:

DOWNLOADER_MIDDLEWARES = {
        #user agent
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'spider_name.comm.rotate_useragent.RotateUserAgentMiddleware' :400,
        #privoxy
        'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        'spider_name.middlewares.ProxyMiddleware': 100
    }

在middlewares.py我补充说:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

我想就这些了……

编辑二 ---

好的,我更改了我的 middlewares.py 文件,正如博客@Tomáš Linhart 所说:

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

from stem import Signal
from stem.control import Controller

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

    def set_new_ip():
        with Controller.from_port(port=9051) as controller:
            controller.authenticate(password='tor_password')
            controller.signal(Signal.NEWNYM)

但是现在真的很慢,而且好像改不了ip了。。。我弄好了还是有问题?

这篇 blog post 可能会对您有所帮助,因为它处理的是同一问题。

编辑: 根据具体要求(每个请求的新 IP 或 N 请求后),适当调用 set_new_ip 在中间件的 process_request 方法中。但是请注意,对 set_new_ip 函数的调用不必始终确保新 IP(FAQ 中有一个 link 和解释)。

EDIT2: 带有 ProxyMiddleware class 的模块看起来像这样:

from stem import Signal
from stem.control import Controller

def _set_new_ip():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate(password='tor_password')
        controller.signal(Signal.NEWNYM)

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        _set_new_ip()
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

But Tor connects with the same IP everytime

那是一个documented Tor feature:

An important thing to note is that a new circuit does not necessarily mean a new IP address. Paths are randomly selected based on heuristics like speed and stability. There are only so many large exits in the Tor network, so it's not uncommon to reuse an exit you have had previously.

这就是为什么使用下面的代码会导致再次使用相同的 IP 地址的原因。

from stem import Signal
from stem.control import Controller


with Controller.from_port(port=9051) as controller:
    controller.authenticate(password='tor_password')
    controller.signal(Signal.NEWNYM)

https://github.com/DusanMadar/TorIpChanger 帮助您管理这种行为。免责声明 - 我写了 TorIpChanger.

我还整理了一份关于如何使用 Python 与 Tor 和 Privoxy 的指南:https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b


下面是一个示例,说明如何在“ProxyMiddleware”中使用“TorIpChanger”(“pip install toripchanger”)。
from toripchanger import TorIpChanger


# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)


class ProxyMiddleware(object):
    def process_request(self, request, spider):
        ip_changer.get_new_ip()
        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])

或者,如果您想在 10 次请求后使用不同的 IP,您可以执行如下操作。

from toripchanger import TorIpChanger


# A Tor IP will be reused only after 10 different IPs were used.
ip_changer = TorIpChanger(reuse_threshold=10)


class ProxyMiddleware(object):
    _requests_count = 0

    def process_request(self, request, spider):
        self._requests_count += 1
        if self._requests_count > 10:
            self._requests_count = 0 
            ip_changer.get_new_ip()

        request.meta['proxy'] = 'http://127.0.0.1:8118'
        spider.log('Proxy : %s' % request.meta['proxy'])