试图用蜘蛛生成器上的异常、奇怪的行为来响应亚马逊的验证码
Trying to response Amazon's Captcha with scrapy, strange behavior on spider generator
出于研究原因,我正在为亚马逊创建一个爬虫,但它被他们的验证码捕获了。
所以我做了一个验证码求解器,但我无法响应验证码表单。
问题是,如果我在方法中放入一个 yeild FormRequest,它似乎不会被调用。
class Havaianas2Spider(scrapy.Spider):
name = 'coleta_dados_grafo'
rank_path = sorted([x for x in os.listdir('scraps') if 'links_base' in x], reverse=True)[0]
lista_links = pd.read_csv('scraps/' + rank_path)
start_urls = lista_links['links'].values
custom_settings = {'FEED_URI': "scraps/produtos_%(time)s.csv",
'FEED_FORMAT': 'csv'}
final_path = '/?th=1&psc=1'
def base_path_get(self, response):
dp_idx = response.request.url.find('/dp/') + 4
base_path = response.request.url[:dp_idx]
return base_path
def solve_captcha(self, response, origin_method):
self.logger.info('SOLVING CAPTCHA!')
captcha_url = response.xpath('//div[@class="a-row a-text-center"]/img/@src').extract_first()
img = load_url(captcha_url)
captcha_string = break_captcha(img)
img.save('C:/Users/Bruno Aquino/Documents/ecom_scraper/amazon_scraper/amazon_scraper/captchas/{}.jpg'.format(
captcha_string))
yield FormRequest.from_response(response,
formdata={'field-keywords': captcha_string},
callback=origin_method)
def verify_if_captcha(self, response):
captcha_url = response.xpath('//div[@class="a-row a-text-center"]/img/@src').extract_first()
if captcha_url:
self.logger.info('PAGE {} GOT BY CAPTCHA!'.format(response.request.url))
return True
else:
return False
def parse(self, response):
captcha = self.verify_if_captcha(response)
if captcha:
self.solve_captcha(response, self.parse)
else:
base_path = self.base_path_get(response)
asin_colors = response.xpath('//div[@id="cerberus-data-metrics"]/@data-asin').extract() +\
[x[4:14] for x in response.xpath('//li[contains(@id,"color_name_")]/@data-dp-url').extract() if '/dp/' in x]
for asin in asin_colors:
new_path = base_path + asin + self.final_path
if asin:
yield scrapy.Request(
response.urljoin(new_path),
callback=self.parse_l2)
亚马逊验证码表单下方
<form method="get" action="/errors/validateCaptcha" name="">
<input type=hidden name="amzn" value="Xnnhl7YtGcH60X2yPaN7eA==" /><input type=hidden name="amzn-r" value="/Capodarte-Chinelo-Preto-38/dp/B07N13Q5F2/?th=1&psc=1" />
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="https://images-na.ssl-images-amazon.com/captcha/yniigayf/Captcha_kbknwlcmvm.jpg">
</div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
<label for="captchacharacters">Type characters</label>
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">
</div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button type="submit" class="a-button-text">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
我在代码里放了两条日志
第一个在verify_if_captcha里面:
self.logger.info('PAGE {} GOT BY CAPTCHA!'.format(response.request.url))
这一个打印出来
里面第二个solve_captcha:
self.logger.info('SOLVING CAPTCHA!')
这个从未打印过
有人能帮帮我吗?
目前,您的表单请求对象永远不会返回给 Scrapy 进行处理。
将 self.solve_captcha(response, self.parse)
替换为 yield from self.solve_captcha(response, self.parse)
。
出于研究原因,我正在为亚马逊创建一个爬虫,但它被他们的验证码捕获了。 所以我做了一个验证码求解器,但我无法响应验证码表单。 问题是,如果我在方法中放入一个 yeild FormRequest,它似乎不会被调用。
class Havaianas2Spider(scrapy.Spider):
name = 'coleta_dados_grafo'
rank_path = sorted([x for x in os.listdir('scraps') if 'links_base' in x], reverse=True)[0]
lista_links = pd.read_csv('scraps/' + rank_path)
start_urls = lista_links['links'].values
custom_settings = {'FEED_URI': "scraps/produtos_%(time)s.csv",
'FEED_FORMAT': 'csv'}
final_path = '/?th=1&psc=1'
def base_path_get(self, response):
dp_idx = response.request.url.find('/dp/') + 4
base_path = response.request.url[:dp_idx]
return base_path
def solve_captcha(self, response, origin_method):
self.logger.info('SOLVING CAPTCHA!')
captcha_url = response.xpath('//div[@class="a-row a-text-center"]/img/@src').extract_first()
img = load_url(captcha_url)
captcha_string = break_captcha(img)
img.save('C:/Users/Bruno Aquino/Documents/ecom_scraper/amazon_scraper/amazon_scraper/captchas/{}.jpg'.format(
captcha_string))
yield FormRequest.from_response(response,
formdata={'field-keywords': captcha_string},
callback=origin_method)
def verify_if_captcha(self, response):
captcha_url = response.xpath('//div[@class="a-row a-text-center"]/img/@src').extract_first()
if captcha_url:
self.logger.info('PAGE {} GOT BY CAPTCHA!'.format(response.request.url))
return True
else:
return False
def parse(self, response):
captcha = self.verify_if_captcha(response)
if captcha:
self.solve_captcha(response, self.parse)
else:
base_path = self.base_path_get(response)
asin_colors = response.xpath('//div[@id="cerberus-data-metrics"]/@data-asin').extract() +\
[x[4:14] for x in response.xpath('//li[contains(@id,"color_name_")]/@data-dp-url').extract() if '/dp/' in x]
for asin in asin_colors:
new_path = base_path + asin + self.final_path
if asin:
yield scrapy.Request(
response.urljoin(new_path),
callback=self.parse_l2)
亚马逊验证码表单下方
<form method="get" action="/errors/validateCaptcha" name="">
<input type=hidden name="amzn" value="Xnnhl7YtGcH60X2yPaN7eA==" /><input type=hidden name="amzn-r" value="/Capodarte-Chinelo-Preto-38/dp/B07N13Q5F2/?th=1&psc=1" />
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="https://images-na.ssl-images-amazon.com/captcha/yniigayf/Captcha_kbknwlcmvm.jpg">
</div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
<label for="captchacharacters">Type characters</label>
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocomplete="off" spellcheck="false" id="captchacharacters" name="field-keywords" class="a-span12" autocapitalize="off" autocorrect="off" type="text">
</div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button type="submit" class="a-button-text">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
我在代码里放了两条日志
第一个在verify_if_captcha里面:
self.logger.info('PAGE {} GOT BY CAPTCHA!'.format(response.request.url))
这一个打印出来
里面第二个solve_captcha:
self.logger.info('SOLVING CAPTCHA!')
这个从未打印过
有人能帮帮我吗?
目前,您的表单请求对象永远不会返回给 Scrapy 进行处理。
将 self.solve_captcha(response, self.parse)
替换为 yield from self.solve_captcha(response, self.parse)
。