Signal.NEWNYM 在 scrapy 中间件中使用时不提供新的 ip 地址
Signal.NEWNYM not giving new ip address when used in scrapy middleware
我正在使用带有 privoxy 和 tor 的 scrapy 网络爬虫。一切都已正确配置,我可以通过 privoxy 抓取 tor 网络。
我希望用于抓取每个地址的 ip 随每个 request/x 请求数而变化。我正在使用 controller.signal(Signal.NEWNYM)
和代理中间件来尝试按照这里的答案进行此操作:,但我没有得到任何 ip 更改。
这是用于更改tor电路和ip的中间件:
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='password')
controller.signal(Signal.NEWNYM)
class ProxyMiddleware(object):
def process_request(self, request, spider):
_set_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
我知道改变 tor 电路并不一定意味着 ip 的变化,但是我在一个单独的脚本中测试了 controller.signal(Signal.NEWNYM)
,发现 tor 电路的变化确实会导致 ip 的周期性变化。这是我用来测试的脚本:
def set_new_ip():
"""Change IP using TOR"""
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='password')
controller.signal(Signal.NEWNYM)
while True:
set_new_ip()
local_proxy = '127.0.0.1:8118'
http_proxy = {
'http': local_proxy,
'https': local_proxy
}
current_ip = requests.get(
url='http://icanhazip.com/',
proxies=http_proxy,
verify=False
)
print(current_ip.content)
从这个脚本我会得到如下输出,显示周期性的 ip 变化:
09.70.100.27\n'
b'109.70.100.27\n'
b'109.70.100.27\n'
b'109.70.100.27\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
但是我的蜘蛛没有得到这种周期性的变化。在 ip-log.csv 中,我只是得到一个一遍又一遍重复的相同 ip 地址的巨大列表。我做错了什么?
这是我正在使用的爬虫代码:
class Spider(scrapy.Spider):
name = 'spider'
dir = '/a/path'
with open(dir + 'results.csv', 'w', newline='') #open .csv file to record results from scraping
with open(dir + 'ip-log.csv', 'w', newline='') #open .csv file to record ip used for each request
def start_requests(self):
url = 'https://url.com'
yield scrapy.request(url, callback=self.parse)
#collect listed urls
def parse(self, response):
path = '/response/xpath/@href'
if response.xpath(path):
for href in response.xpath(path).extract():
yield Request(url=response.urljoin(href), callback=self.save_result)
url = response.request.url.split('&')
for item in url:
if item.startswith('index='):
page_index = item.split('=')[-1]
next_page = ['index=' + str(int(page_index) + 24) if x.startswith('index=') else x for x in url]
next_page = '&'.join(next_page)
yield scrapy.Request(url=next_page, callback=self.parse)
#use icanhazip.com to get ip used for request
yield scrapy.Request('https://icanhazip.com', callback=self.check_ip, dont_filter=True)
#record ip
def check_ip(self, response):
ip = response.xpath('/html/body/p').extract()
dir = '/a/path'
with open(dir + '/ip-log.csv', 'a+', newline='') as f: #write request ip in .csv file
writer = csv.writer(f)
writer.writerow([ip])
yield scrapy.Request('https://icanhazip.com', callback=self.parse, dont_filter=True)
#visit each url and save results
def save_result(self, response):
dir = '/a/path'
path = '/desired/xpath'
result = response.xpath(path).extract()
with open(dir + '/results.csv', 'a+', newline='') as f:
writer = csv.writer(f)
writer.writerow([price]) #save results to results.csv
显然 tor 不想在访问 icanhazip.com 时切换 ip。我在不同的网站上尝试了相同的代码('http://whatsmyuseragent.org/') and the ip is now changing periodically. Tested this with relevant middleware disabled (http://whatsmyuseragent.org/ 显示相同的未隐藏 ip,没有定期更改)。
我正在使用带有 privoxy 和 tor 的 scrapy 网络爬虫。一切都已正确配置,我可以通过 privoxy 抓取 tor 网络。
我希望用于抓取每个地址的 ip 随每个 request/x 请求数而变化。我正在使用 controller.signal(Signal.NEWNYM)
和代理中间件来尝试按照这里的答案进行此操作:
这是用于更改tor电路和ip的中间件:
def _set_new_ip():
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='password')
controller.signal(Signal.NEWNYM)
class ProxyMiddleware(object):
def process_request(self, request, spider):
_set_new_ip()
request.meta['proxy'] = 'http://127.0.0.1:8118'
spider.log('Proxy : %s' % request.meta['proxy'])
我知道改变 tor 电路并不一定意味着 ip 的变化,但是我在一个单独的脚本中测试了 controller.signal(Signal.NEWNYM)
,发现 tor 电路的变化确实会导致 ip 的周期性变化。这是我用来测试的脚本:
def set_new_ip():
"""Change IP using TOR"""
with Controller.from_port(port=9051) as controller:
controller.authenticate(password='password')
controller.signal(Signal.NEWNYM)
while True:
set_new_ip()
local_proxy = '127.0.0.1:8118'
http_proxy = {
'http': local_proxy,
'https': local_proxy
}
current_ip = requests.get(
url='http://icanhazip.com/',
proxies=http_proxy,
verify=False
)
print(current_ip.content)
从这个脚本我会得到如下输出,显示周期性的 ip 变化:
09.70.100.27\n'
b'109.70.100.27\n'
b'109.70.100.27\n'
b'109.70.100.27\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'198.98.58.135\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
b'185.220.101.2\n'
但是我的蜘蛛没有得到这种周期性的变化。在 ip-log.csv 中,我只是得到一个一遍又一遍重复的相同 ip 地址的巨大列表。我做错了什么?
这是我正在使用的爬虫代码:
class Spider(scrapy.Spider):
name = 'spider'
dir = '/a/path'
with open(dir + 'results.csv', 'w', newline='') #open .csv file to record results from scraping
with open(dir + 'ip-log.csv', 'w', newline='') #open .csv file to record ip used for each request
def start_requests(self):
url = 'https://url.com'
yield scrapy.request(url, callback=self.parse)
#collect listed urls
def parse(self, response):
path = '/response/xpath/@href'
if response.xpath(path):
for href in response.xpath(path).extract():
yield Request(url=response.urljoin(href), callback=self.save_result)
url = response.request.url.split('&')
for item in url:
if item.startswith('index='):
page_index = item.split('=')[-1]
next_page = ['index=' + str(int(page_index) + 24) if x.startswith('index=') else x for x in url]
next_page = '&'.join(next_page)
yield scrapy.Request(url=next_page, callback=self.parse)
#use icanhazip.com to get ip used for request
yield scrapy.Request('https://icanhazip.com', callback=self.check_ip, dont_filter=True)
#record ip
def check_ip(self, response):
ip = response.xpath('/html/body/p').extract()
dir = '/a/path'
with open(dir + '/ip-log.csv', 'a+', newline='') as f: #write request ip in .csv file
writer = csv.writer(f)
writer.writerow([ip])
yield scrapy.Request('https://icanhazip.com', callback=self.parse, dont_filter=True)
#visit each url and save results
def save_result(self, response):
dir = '/a/path'
path = '/desired/xpath'
result = response.xpath(path).extract()
with open(dir + '/results.csv', 'a+', newline='') as f:
writer = csv.writer(f)
writer.writerow([price]) #save results to results.csv
显然 tor 不想在访问 icanhazip.com 时切换 ip。我在不同的网站上尝试了相同的代码('http://whatsmyuseragent.org/') and the ip is now changing periodically. Tested this with relevant middleware disabled (http://whatsmyuseragent.org/ 显示相同的未隐藏 ip,没有定期更改)。