如何轮换代理和用户代理

How to rotate proxies and user agents

我正在编写一个 Scrapy 程序,我在这个网站上登录并为不同的扑克牌抓取数据,http://www.starcitygames.com/buylist/。但我只从这个 url 中抓取 ID 值,然后我使用该 ID 号重定向到另一个 URL 并抓取那个 JSON 网页,并对所有 207 种不同类别的卡片执行此操作。我看起来更真实一点,然后直接使用 JSON 数据直接进入 URL。无论如何,我之前用多个 URLs 编写过 Scrapy 程序,我能够将这些程序设置为轮换代理和用户代理,但我将如何在这个程序中做到这一点?由于技术上只有一个 URL,是否有一种方法可以将其设置为在抓取大约 5 个不同的 JSON 数据页后切换到不同的代理和用户代理?我不希望它随机旋转。我希望它每次都使用相同的代理和用户代理来抓取相同的 JSON 网页。我希望一切都有意义。这对于堆栈溢出来说可能有点宽泛,但我不知道如何做到这一点,所以我想无论如何我都会问一下,看看是否有人对如何做到这一点有任何好的想法。

# Import needed functions and call needed python files
import scrapy
import json
from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import DataItem

# Spider class
class LoginSpider(scrapy.Spider):
    # Name of spider
    name = "LoginSpider"

    #URL where dated is located
    start_urls = ["http://www.starcitygames.com/buylist/"]

    # Login function
    def parse(self, response):
        # Login using email and password than proceed to after_login function
        return scrapy.FormRequest.from_response(
        response,
        formcss='#existing_users form',
        formdata={'ex_usr_email': 'example@email.com', 'ex_usr_pass': 'password'},
        callback=self.after_login
        )


    # Function to barse buylist website
    def after_login(self, response):
        # Loop through website and get all the ID numbers for each category of card and plug into the end of the below
        # URL then go to parse data function
        for category_id in response.xpath('//select[@id="bl-category-options"]/option/@value').getall():
            yield scrapy.Request(
                    url="http://www.starcitygames.com/buylist/search?search-type=category&id={category_id}".format(category_id=category_id),
                    callback=self.parse_data,
                    )
    # Function to parse JSON dasta
    def parse_data(self, response):
        # Declare variables
        jsonreponse = json.loads(response.body_as_unicode())
        # Call DataItem class from items.py
        items = DataItem()

        # Scrape category name
        items['Category'] = jsonreponse['search']
        # Loop where other data is located
        for result in jsonreponse['results']:
            # Inside this loop, run through loop until all data is scraped
            for index in range(len(result)):
                # Scrape the rest of needed data
                items['Card_Name'] = result[index]['name']
                items['Condition'] = result[index]['condition']
                items['Rarity'] = result[index]['rarity']
                items['Foil'] = result[index]['foil']
                items['Language'] = result[index]['language']
                items['Buy_Price'] = result[index]['price']
                # Return all data
                yield items

我会为你推荐这个套餐 Scrapy-UserAgents

pip install scrapy-useragents

在您的 setting.py 文件中

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500,

}

要轮换的用户代理示例列表

More User Agents

USER_AGENTS = [
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/57.0.2987.110 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.79 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) '
     'Gecko/20100101 '
     'Firefox/55.0'),  # firefox
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/61.0.3163.91 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/62.0.3202.89 '
     'Safari/537.36'),  # chrome
    ('Mozilla/5.0 (X11; Linux x86_64) '
     'AppleWebKit/537.36 (KHTML, like Gecko) '
     'Chrome/63.0.3239.108 '
     'Safari/537.36'),  # chrome
]

注意这个中间件无法处理COOKIES_ENABLED为True的情况,网站绑定cookies为User-Agent,可能会导致蜘蛛无法预料的结果。

代理服务器 我会找一家提供旋转器的公司,这样您就不必弄乱它,但是您可以编写一个自定义中间件,我将向您展示如何操作。您要做的是编辑流程请求方法。您将同时更改代理和更改用户代理。

UserAgents 您可以使用 Scrapy 随机用户代理中间件 https://github.com/cleocn/scrapy-random-useragent 或者这就是您可以使用包括代理或任何其他 headers.[=13= 的中间件更改请求 object 的任何内容的方法]

# middlewares.py

user_agents = ['agent1', 'agent2', 'agent3', 'agent4']
proxies = ['ip1:port1', 'ip2:port2', 'ip3:port3', 'ip4:port4'

# either have your user agents in a file or something this assumes you are able to get them into a list.

class MyMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        request.headers['User-Agent'] = random.choice(user_agents) # !! These 2 lines
        request.meta['proxy'] = random.choice(proxies) # !! These 2 lines
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

# settings.py


DOWNLOADER_MIDDLEWARES = {
    'project.middlewares.MyMiddleware': 543,
}

参考资料: https://docs.scrapy.org/en/latest/topics/request-response.html

用户代理: 我已经使用了这个工具,它将使您的用户代理列表始终与最新和最常用的用户代理一起更新: https://pypi.org/project/shadow-useragent/


    from shadow_useragent import ShadowUserAgent
    shadow_useragent = ShadowUserAgent()

     print(shadow_useragent.firefox)
     # Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:67.0) Gecko/20100101 Firefox/67.0
     print(shadow_useragent.chrome)
     # Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36
     print(shadow_useragent.safari)
     # Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
     print(shadow_useragent.edge)
     # Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134
     print(shadow_useragent.ie)
     # Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko
     print(shadow_useragent.android)
     # Mozilla/5.0 (Linux; U; Android 4.3; en-us; SM-N900T Build/JSS15J) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
     print(shadow_useragent.ipad)
     # Mozilla/5.0 (iPad; CPU OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Mobile/15E148 Safari/604.1
     print(shadow_useragent.random)
     # Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0

     print(shadow_useragent.random_nomobile)
     # Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36

    # and the best one, random via real world browser usage statistic
    print(ua.random)
    # Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36

    # if you want to excluse mobiles (some websites will display different pages)
    print(shadow_useragent.random_nomobile)
    # Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.90 Safari/537.36