使用不同的用户代理和 IP 地址抓取不同的 URL

Scrape different URL's with different user agents and IP Addresses

我有一个程序需要使用 scrapy 抓取几个不同的 url,我需要它为每个 url 使用相同的用户代理和 IP 地址。因此,如果我要抓取 50 urls,我需要每个 url 都有一个唯一的用户代理和 IP 地址,仅在抓取 url 时使用。但是当程序抓取下一个 url.

时,IP 地址和用户代理会发生变化

我已经可以随机轮换用户代理,但现在我只需要将用户代理与不同的 url 配对,并且每次都使用具有相同 url 的相同用户代理。至于 IP 地址,我什至无法让它随机旋转它们,更不用说将它们与一个唯一的 url.

配对了

SplashSpider.py

from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import GameItem

class MySpider(Spider):
        name = 'splash_spider' # Name of Spider
        start_urls = [''] # url(s)
#
#
#
#......
# all the urls I need to scrape, 50+ will go in these lines
        def start_requests(self):
                for url in self.start_urls:
                        yield SplashRequest(url=url, callback=self.parse, args={"wait": 3})
        #Scraping
        def parse(self, response):
                item = GameItem()
                for game in response.css(""): #loop to go through contents of webpage until all needed info is scrapped 
                    # Card Name
                    item["card name"] = game.css("").extract_first() #html code corresponding to card name
                    # Price
                    item["Price"] = game.css("td.deckdbbody.search_results_9::text").extract_first() #code corresponding to price
                    yield item            

settings.py


SPIDER_MODULES = ['scrapy_javascript.spiders'] NEWSPIDER_MODULE = 'scrapy_javascript.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_javascript (http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------
DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, }

USER_AGENTS = [
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]


# -----------------------------------------------------------------------------
# IP ADDRESSES
# -----------------------------------------------------------------------------
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,  }

ROTATING_PROXY_LIST = [
    'http://199.89.192.76:8050'
    'http://199.89.192.77:8050'
    'http://199.89.192.78:8050'
    'http://199.89.193.2:8050'
    'http://199.89.193.3:8050'
    'http://199.89.193.4:8050'
    'http://199.89.193.5:8050'
    'http://199.89.193.6:8050'
    'http://199.89.193.7:8050'
    'http://199.89.193.8:8050'
    'http://199.89.193.9:8050'
    'http://199.89.193.10:8050'
    'http://199.89.193.11:8050'
    'http://199.89.193.12:8050'
    'http://199.89.193.13:8050'
    'http://199.89.193.14:8050'
    'http://204.152.114.226:8050'
    'http://204.152.114.227:8050'
    'http://204.152.114.228:8050'
    'http://204.152.114.229:8050'
    'http://204.152.114.230:8050'
    'http://204.152.114.232:8050'
    'http://204.152.114.233:8050'
    'http://204.152.114.234:8050'
    'http://204.152.114.235:8050'
    'http://204.152.114.236:8050'
    'http://204.152.114.237:8050'
    'http://204.152.114.238:8050'
    'http://204.152.114.239:8050'
    'http://204.152.114.240:8050'
    'http://204.152.114.241:8050'
    'http://204.152.114.242:8050'
    'http://204.152.114.243:8050'
    'http://204.152.114.244:8050'
    'http://204.152.114.245:8050'
    'http://204.152.114.246:8050'
    'http://204.152.114.247:8050'
    'http://204.152.114.248:8050'
    'http://204.152.114.249:8050'
    'http://204.152.114.250:8050'
    'http://204.152.114.251:8050'
    'http://204.152.114.252:8050'
    'http://204.152.114.253:8050'
    'http://204.152.114.254:8050'
    ]
SPLASH_URL = 'http://199.89.192.74:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

最后,它应该简单地将我需要抓取的每个 url 与我在 settings.py 文件中的列表的 IP 地址和用户代理配对。

这超出了一个简单的 Whosebug 问题的范围。

然而,处理由 scrapy 爬虫发出的自定义请求的一般方法是编写一个 下载中间件 [1].

在您的示例中,您想要编写一个下载器中间件:

1. Generate profiles on spider start by making a list of `(ip, user-agent)` tuples
2. Make a round-robing (or alternative) queue of these profiles
3. Adjust every sent request with one random profile

简而言之,代码如下所示:

# middlewares.py
import random
from copy import copy

class ProfileMiddleware:

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        mw = cls(crawler, *args, **kwargs)
        crawler.signals.connect(mw.spider_opened, signal=signals.spider_opened)
        mw.settings = crawler.settings
        return mw

    def spider_opened(self, spider, **kwargs):
        proxies = self.settings.getlist('PROXIES')
        user_agents = self.settings.getlist('USER_AGENTS')
        self.profiles = list(zip(proxies, user_agents))
        self.queue = copy(self.profiles)
        random.shuffle(self.queue)

    def process_request(self, request, spider):
        if not self.queue:
            self.queue = copy(self.profiles)
            random.shuffle(self.queue)

        profile = self.queue.pop()
        request.headers['User-Agent'] = profile[1]
        request.meta['proxy'] = profile[0]

我没有测试过这个,只是为了说明总体思路

然后在中间件链末尾的某处激活它:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ProfileMiddleware': 900, 
}
PROXIES = ['123', '456'...]
USER_AGENTS = ['firefox', 'chrome'...]

1 - 有关 scrapy 的下载器中间件的更多信息:https://docs.scrapy.org/en/latest/topics/downloader-middleware.html