从网页中提取 reCaptcha 以通过 cURL 在外部完成,然后 return 结果以查看页面

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

我正在创建一个供个人使用的网络抓取工具,它根据我的个人输入抓取汽车经销商网站,但我试图从被重定向的验证码页面阻止的几个网站收集数据。我正在使用 curl returns this HTML

抓取的当前站点
<html>
   <head>
      <title>You have been blocked</title>
      <style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
   </head>
   <body style="margin:0">
      <p id="cmsg">Please enable JS and disable any ad blocker</p>
      <script>
            var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
      </script>
      <script src="https://ct.captcha-delivery.com/c.js"></script>
   </body>
</html>

我正在使用它来抓取页面:

<?php

function web_scrape($url)
{
    $ch = curl_init();
    $imei = "013977000272744";

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_POST, 1);
    curl_setopt($ch, CURLOPT_VERBOSE, 1);
    curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035;  _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
    curl_setopt($ch, CURLOPT_POSTFIELDS, array(
        'imei' => $imei,
    ));

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $server_output = curl_exec($ch);
    return $server_output;

    curl_close($ch);

}
echo web_scrape($url);

?>

并重申我想做的事情;我想从这个页面收集 Recaptcha,所以当我想查看外部网站上的页面详细信息时,我可以在我的外部网站上填写 Recaptcha,然后抓取最初估算的页面。 任何回应都会很棒!

看到 Stack overflow 上的社区懒得回答我的问题,我不得不经过数小时的反复试验并找到了解决方案。我想既然我真的想造福于社区,我会回答我自己的问题:

这里的问题是我正在抓取的网站设置了抓取程序检测,而不是我之前认为的 recaptcha。因此,在对我的爬虫进行一些修补之后,我设法赋予它绕过站点检测的能力。看到没有其他人遇到这个问题,我不会费心分享代码,但我必须更好地将我的抓取工具表示为实际的浏览器。

如果其他人发现自己 运行 遇到此类问题,请随时与我联系,我将分享我的一些代码。

基于对代码的高要求,这里是我绕过这个特定问题的升级爬虫。但是我尝试获取验证码没有成功,我仍然没有解决如何获取它。

  include "simple_html_dom.php";
  /**
   * Get a web file (HTML, XHTML, XML, image, etc.) from a URL.  Return an
   * array containing the HTTP server response header fields and content.
   */
  // This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
  function get_web_page( $url ) { 
    $options = array(
        CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_HEADER         => false,    // don't return headers
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects
        CURLOPT_ENCODING       => "",       // handle all encodings
        CURLOPT_USERAGENT      => "spider", // who am i
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
        CURLOPT_TIMEOUT        => 120,      // timeout on response
        CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        CURLOPT_SSL_VERIFYPEER => false     // Disabled SSL Cert checks
    );

    $ch      = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
    curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
    $content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
    $err     = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
    //I can see what parts of the page has it seen and more importantly hasnt seen
    $errmsg  = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
    $header  = curl_getinfo( $ch ); //the information of the page stored in a array
    curl_close( $ch ); //Closes the Curler to save site memory

    $header['errno']   = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
    $header['errmsg']  = $errmsg; //sending the header data to the previously made error message checker function.
    $header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
    return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
  };

  //using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
  $response_dev = get_web_page($url);
  
  // print_r($response_dev);

  $response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in  the case that the site runs into a issue

Datadome 目前正在使用 Recaptcha v2 和 GeeTest 验证码,因此您的脚本应该这样做:

  1. 导航到重定向 https://geo.captcha-delivery.com/captcha/?initialCid=…
  2. 检测使用的验证码类型。
  3. 使用 Anti Captcha 等任何验证码解决服务获取此验证码的令牌。
  4. 提交令牌,检查您是否被重定向到目标页面。
  5. 有时目标页面包含地址为 https://geo.captcha-delivery.com/captcha/?initialCid=. 的 iframe。 , 因此您需要在此 iframe 中从第 2 步开始重复。

我不确定上面的步骤是否可以用 PHP 完成,但你可以用浏览器自动化引擎来完成,比如 Puppeteer,一个用于 NodeJS 的库。它启动一个 Chromium 实例并模拟真实的用户存在。 NodeJS 是构建专业爬虫的必备工具,值得在 Youtube 课程中投入一些时间。 这是执行上述所有步骤的脚本:https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js 您需要一个代理来绕过 GeeTest 保护。