使用 scrapy-rotating-proxies 包手动将代理设置为死
Manually set a proxy as dead using scrapy-rotating-proxies package
我在我的项目中使用代理轮换来防止被网站禁止,我必须抓取一个 url 列表 http://website/0001 to http://website/9999,当它检测到我在抓取时,他们会将我发送到 website/contact.html.
我的代理列表已经在设置中
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
我创建了这个蜘蛛:
next_page_url = response.url[17:]//getting the relative url from website/page
if next_page_url == "contact.html":
absolute_next_page = response.urljoin(last_page)
yield Request(absolute_next_page)
//should try the same page with different proxy
else:
next_page_url = int(next_page_url)+1
last_page = str(next_page_url).zfill(4)
absolute_next_page = response.urljoin(last_page)
yield Request(absolute_next_page)`
但它给出了一个错误,说 UnboundLocalError: local variable 'last_page' referenced before assignment
如何指定此蜘蛛中的代理已死?或者有其他方法可以做同样的事情吗?
你想问什么?
你是说你出错了
UnboundLocalError: local variable 'last_page' referenced before assignment
此错误表明您正在尝试使用未初始化货币的变量。
所以为了防止这个错误,像这样更改你的代码
next_page_url = response.url[17:]//getting the relative url from website/page
next_page_url = int(next_page_url)+1
last_page = str(next_page_url).zfill(4)
absolute_next_page = response.urljoin(last_page)
if next_page_url == "contact.html":
next_page_url = int(next_page_url)+1
absolute_next_page = response.urljoin(last_page)
req = Request(url = absolute_next_page)
// If you want to try the same link again, then do this
// req = Request(url = response.url)
req.meta['proxy'] = random.choice(ROTATING_PROXY_LIST) // choose a random proxy
yield req
else:
yield Request(absolute_next_page)
我在我的项目中使用代理轮换来防止被网站禁止,我必须抓取一个 url 列表 http://website/0001 to http://website/9999,当它检测到我在抓取时,他们会将我发送到 website/contact.html.
我的代理列表已经在设置中
ROTATING_PROXY_LIST = [
'proxy1.com:8000',
'proxy2.com:8031',
# ...
]
我创建了这个蜘蛛:
next_page_url = response.url[17:]//getting the relative url from website/page
if next_page_url == "contact.html":
absolute_next_page = response.urljoin(last_page)
yield Request(absolute_next_page)
//should try the same page with different proxy
else:
next_page_url = int(next_page_url)+1
last_page = str(next_page_url).zfill(4)
absolute_next_page = response.urljoin(last_page)
yield Request(absolute_next_page)`
但它给出了一个错误,说 UnboundLocalError: local variable 'last_page' referenced before assignment
如何指定此蜘蛛中的代理已死?或者有其他方法可以做同样的事情吗?
你想问什么?
你是说你出错了
UnboundLocalError: local variable 'last_page' referenced before assignment
此错误表明您正在尝试使用未初始化货币的变量。
所以为了防止这个错误,像这样更改你的代码
next_page_url = response.url[17:]//getting the relative url from website/page
next_page_url = int(next_page_url)+1
last_page = str(next_page_url).zfill(4)
absolute_next_page = response.urljoin(last_page)
if next_page_url == "contact.html":
next_page_url = int(next_page_url)+1
absolute_next_page = response.urljoin(last_page)
req = Request(url = absolute_next_page)
// If you want to try the same link again, then do this
// req = Request(url = response.url)
req.meta['proxy'] = random.choice(ROTATING_PROXY_LIST) // choose a random proxy
yield req
else:
yield Request(absolute_next_page)