在 scrapy_fake_useragent 和 cfscrape scrapy 扩展之间共享 USER_AGENT
Share USER_AGENT between scrapy_fake_useragent and cfscrape scrapy extension
我正在尝试使用 cfscrape
、privoxy
和 tor 以及 scrapy_fake_useragent
为受 cloudfare 保护的网站创建一个爬虫
我正在使用 cfscrape
python extension to bypass cloudfare protection with scrapy and scrapy_fake_useragent
将随机真实 USER_AGENT 信息注入 headers。
如 cfscrape 文档所示:您必须使用相同的 user-agent 字符串来获取令牌并使用这些令牌发出请求,否则 Cloudflare 会将您标记为机器人。
To collect cookie needed by `cfscrape`, i need to redefine the `start_request` function into my spider class, like this :
def start_requests(self):
cf_requests = []
for url in self.start_urls:
token, agent = cfscrape.get_tokens(url)
self.logger.info("agent = %s", agent)
cf_requests.append(scrapy.Request(url=url,
cookies= token,
headers={'User-Agent': agent}))
return cf_requests
我的问题是 start_requests
收集的 user_agent
与 scrapy_fake_useragent
随机选择的 user_agent
不同,如您所见:
017-01-11 12:15:08 [airports] INFO: agent = Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0
2017-01-11 12:15:08 [scrapy.core.engine] INFO: Spider opened
2017-01-11 12:15:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-11 12:15:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-11 12:15:08 [scrapy_fake_useragent.middleware] DEBUG: Assign User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 to Proxy http://127.0.0.1:8118
我在 settings.py
这个命令中定义了我的扩展:
RANDOM_UA_PER_PROXY = True
HTTPS_PROXY = 'http://127.0.0.1:8118'
COOKIES_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'flight_project.middlewares.ProxyMiddleware': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':110,
}
我需要相同的 user_agent
,那么我如何 pass/get scrapy_fake_useragent
随机给出的用户代理进入 cfscrape
扩展的 start_requests
方法?
在 scrapy_user_agent
开发人员的帮助下终于找到了答案。停用 settings.py
中的 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400
行,然后编写此源代码:
class AirportsSpider(scrapy.Spider):
name = "airports"
start_urls = ['https://www.flightradar24.com/data/airports']
allowed_domains = ['flightradar24.com']
ua = UserAgent()
...
def start_requests(self):
cf_requests = []
user_agent = self.ua.random
self.logger.info("RANDOM user_agent = %s", user_agent)
for url in self.start_urls:
token , agent = cfscrape.get_tokens(url,user_agent)
self.logger.info("token = %s", token)
self.logger.info("agent = %s", agent)
cf_requests.append(scrapy.Request(url=url,
cookies= token,
headers={'User-Agent': agent}))
return cf_requests
我正在尝试使用 cfscrape
、privoxy
和 tor 以及 scrapy_fake_useragent
我正在使用 cfscrape
python extension to bypass cloudfare protection with scrapy and scrapy_fake_useragent
将随机真实 USER_AGENT 信息注入 headers。
如 cfscrape 文档所示:您必须使用相同的 user-agent 字符串来获取令牌并使用这些令牌发出请求,否则 Cloudflare 会将您标记为机器人。
To collect cookie needed by `cfscrape`, i need to redefine the `start_request` function into my spider class, like this :
def start_requests(self):
cf_requests = []
for url in self.start_urls:
token, agent = cfscrape.get_tokens(url)
self.logger.info("agent = %s", agent)
cf_requests.append(scrapy.Request(url=url,
cookies= token,
headers={'User-Agent': agent}))
return cf_requests
我的问题是 start_requests
收集的 user_agent
与 scrapy_fake_useragent
随机选择的 user_agent
不同,如您所见:
017-01-11 12:15:08 [airports] INFO: agent = Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0
2017-01-11 12:15:08 [scrapy.core.engine] INFO: Spider opened
2017-01-11 12:15:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-11 12:15:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-11 12:15:08 [scrapy_fake_useragent.middleware] DEBUG: Assign User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 to Proxy http://127.0.0.1:8118
我在 settings.py
这个命令中定义了我的扩展:
RANDOM_UA_PER_PROXY = True
HTTPS_PROXY = 'http://127.0.0.1:8118'
COOKIES_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'flight_project.middlewares.ProxyMiddleware': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':110,
}
我需要相同的 user_agent
,那么我如何 pass/get scrapy_fake_useragent
随机给出的用户代理进入 cfscrape
扩展的 start_requests
方法?
在 scrapy_user_agent
开发人员的帮助下终于找到了答案。停用 settings.py
中的 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400
行,然后编写此源代码:
class AirportsSpider(scrapy.Spider):
name = "airports"
start_urls = ['https://www.flightradar24.com/data/airports']
allowed_domains = ['flightradar24.com']
ua = UserAgent()
...
def start_requests(self):
cf_requests = []
user_agent = self.ua.random
self.logger.info("RANDOM user_agent = %s", user_agent)
for url in self.start_urls:
token , agent = cfscrape.get_tokens(url,user_agent)
self.logger.info("token = %s", token)
self.logger.info("agent = %s", agent)
cf_requests.append(scrapy.Request(url=url,
cookies= token,
headers={'User-Agent': agent}))
return cf_requests