Python - 无法在 Scrapy 中动态旋转 userAgent
Python - Unable to rotate userAgent dynamically in Scrapy
我正在覆盖 scrapy 模块的默认实现 HttpProxyMiddleware 和 UserAgentMiddleware,我自己的 scrapy 实现轮换用户代理和 IP 地址,它从提供的列表中随机选择值。 IP 会根据每个请求而变化,但用户代理不会。我无法弄清楚原因。
这是我对 classes
的实现
RotateUserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
# Add desired logging message here.
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request)
)
代理中间件
class ProxyMiddleware(HttpProxyMiddleware):
def __init__(self, proxy_ip=''):
self.proxy_ip = proxy_ip
def process_request(self,request,spider):
ip = random.choice(self.proxy_list)
if ip:
request.meta['proxy'] = ip
print(request.meta)
return request
在 Downloader_Middleware 和 settings.py 中所做的更改是;
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
'IpRotation.ProxyMiddleware.ProxyMiddleware': 800,
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware':790
}
在我的控制台上为 每个请求 打印 Ip 和 user-agent 值:
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '213.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:48 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '58.*.*.*:80'}
没有在 settings.py 中更改 USER_AGENT,因为我必须随机分配值:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'IPProxy (+http://www.yourdomain.com)'
在整个项目中,我不清楚的地方是给Downloader_Middleware赋值。 None 说 scrapy 忽略 class 但 Integers 说什么?请有人帮助我。
将Downloader_Middleware中'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware'的值改成小于400。
我正在覆盖 scrapy 模块的默认实现 HttpProxyMiddleware 和 UserAgentMiddleware,我自己的 scrapy 实现轮换用户代理和 IP 地址,它从提供的列表中随机选择值。 IP 会根据每个请求而变化,但用户代理不会。我无法弄清楚原因。
这是我对 classes
的实现RotateUserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
# Add desired logging message here.
spider.log(
u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request)
)
代理中间件
class ProxyMiddleware(HttpProxyMiddleware):
def __init__(self, proxy_ip=''):
self.proxy_ip = proxy_ip
def process_request(self,request,spider):
ip = random.choice(self.proxy_list)
if ip:
request.meta['proxy'] = ip
print(request.meta)
return request
在 Downloader_Middleware 和 settings.py 中所做的更改是;
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
'IpRotation.ProxyMiddleware.ProxyMiddleware': 800,
'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware':790
}
在我的控制台上为 每个请求 打印 Ip 和 user-agent 值:
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '213.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:48 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '58.*.*.*:80'}
没有在 settings.py 中更改 USER_AGENT,因为我必须随机分配值:
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'IPProxy (+http://www.yourdomain.com)'
在整个项目中,我不清楚的地方是给Downloader_Middleware赋值。 None 说 scrapy 忽略 class 但 Integers 说什么?请有人帮助我。
将Downloader_Middleware中'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware'的值改成小于400。