从网页中提取 reCaptcha 以通过 cURL 在外部完成,然后 return 结果以查看页面
extract reCaptcha from web page to be completed externally via cURL and then return results to view page
我正在创建一个供个人使用的网络抓取工具,它根据我的个人输入抓取汽车经销商网站,但我试图从被重定向的验证码页面阻止的几个网站收集数据。我正在使用 curl returns this HTML
抓取的当前站点
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
我正在使用它来抓取页面:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
并重申我想做的事情;我想从这个页面收集 Recaptcha,所以当我想查看外部网站上的页面详细信息时,我可以在我的外部网站上填写 Recaptcha,然后抓取最初估算的页面。
任何回应都会很棒!
看到 Stack overflow 上的社区懒得回答我的问题,我不得不经过数小时的反复试验并找到了解决方案。我想既然我真的想造福于社区,我会回答我自己的问题:
这里的问题是我正在抓取的网站设置了抓取程序检测,而不是我之前认为的 recaptcha。因此,在对我的爬虫进行一些修补之后,我设法赋予它绕过站点检测的能力。看到没有其他人遇到这个问题,我不会费心分享代码,但我必须更好地将我的抓取工具表示为实际的浏览器。
如果其他人发现自己 运行 遇到此类问题,请随时与我联系,我将分享我的一些代码。
基于对代码的高要求,这里是我绕过这个特定问题的升级爬虫。但是我尝试获取验证码没有成功,我仍然没有解决如何获取它。
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue
Datadome 目前正在使用 Recaptcha v2 和 GeeTest 验证码,因此您的脚本应该这样做:
- 导航到重定向 https://geo.captcha-delivery.com/captcha/?initialCid=…。
- 检测使用的验证码类型。
- 使用 Anti Captcha 等任何验证码解决服务获取此验证码的令牌。
- 提交令牌,检查您是否被重定向到目标页面。
- 有时目标页面包含地址为 https://geo.captcha-delivery.com/captcha/?initialCid=. 的 iframe。 , 因此您需要在此 iframe 中从第 2 步开始重复。
我不确定上面的步骤是否可以用 PHP 完成,但你可以用浏览器自动化引擎来完成,比如 Puppeteer,一个用于 NodeJS 的库。它启动一个 Chromium 实例并模拟真实的用户存在。 NodeJS 是构建专业爬虫的必备工具,值得在 Youtube 课程中投入一些时间。
这是执行上述所有步骤的脚本:https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js
您需要一个代理来绕过 GeeTest 保护。
我正在创建一个供个人使用的网络抓取工具,它根据我的个人输入抓取汽车经销商网站,但我试图从被重定向的验证码页面阻止的几个网站收集数据。我正在使用 curl returns this HTML
抓取的当前站点<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
我正在使用它来抓取页面:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
并重申我想做的事情;我想从这个页面收集 Recaptcha,所以当我想查看外部网站上的页面详细信息时,我可以在我的外部网站上填写 Recaptcha,然后抓取最初估算的页面。 任何回应都会很棒!
看到 Stack overflow 上的社区懒得回答我的问题,我不得不经过数小时的反复试验并找到了解决方案。我想既然我真的想造福于社区,我会回答我自己的问题:
这里的问题是我正在抓取的网站设置了抓取程序检测,而不是我之前认为的 recaptcha。因此,在对我的爬虫进行一些修补之后,我设法赋予它绕过站点检测的能力。看到没有其他人遇到这个问题,我不会费心分享代码,但我必须更好地将我的抓取工具表示为实际的浏览器。
如果其他人发现自己 运行 遇到此类问题,请随时与我联系,我将分享我的一些代码。
基于对代码的高要求,这里是我绕过这个特定问题的升级爬虫。但是我尝试获取验证码没有成功,我仍然没有解决如何获取它。
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue
Datadome 目前正在使用 Recaptcha v2 和 GeeTest 验证码,因此您的脚本应该这样做:
- 导航到重定向 https://geo.captcha-delivery.com/captcha/?initialCid=…。
- 检测使用的验证码类型。
- 使用 Anti Captcha 等任何验证码解决服务获取此验证码的令牌。
- 提交令牌,检查您是否被重定向到目标页面。
- 有时目标页面包含地址为 https://geo.captcha-delivery.com/captcha/?initialCid=. 的 iframe。 , 因此您需要在此 iframe 中从第 2 步开始重复。
我不确定上面的步骤是否可以用 PHP 完成,但你可以用浏览器自动化引擎来完成,比如 Puppeteer,一个用于 NodeJS 的库。它启动一个 Chromium 实例并模拟真实的用户存在。 NodeJS 是构建专业爬虫的必备工具,值得在 Youtube 课程中投入一些时间。 这是执行上述所有步骤的脚本:https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js 您需要一个代理来绕过 GeeTest 保护。