尽管 IP 轮换无法抓取

Question

我需要抓取此页面（广告）：https://www.sahibinden.com/en/cars/used?date=1day&a5_min=2005&a5_max=2020

当我打开它太多次时，我被阻止了，更改 IP 也无济于事。问题是，当我从我的 PC 的浏览器中打开此页面时，它工作正常。但是好像被webkit屏蔽了

await page.route("**/*", (route) => {
    if (!firstReq) route.continue();
    else {
      firstReq = false;
      route.continue({
        method: method,
        postData: data,
        headers: headers,
      });
    }
  });
  let pageRes = await page.goto(url);
  await page.waitForNavigation()
  await page.unroute("**/*");
  return pageRes;

我知道这是试图阻止机器人的网站，但有哪些做法可以避免这种情况。我尝试了等待、ip 轮换以及用户代理轮换 - 似乎没有任何效果

Answer 1

在他们的 Terms of Use §4.11 中，他们声明不允许抓取他们的内容：

The use of the whole or any part of the "Portal" for [...] Automatic program on the site, robot, spider, web crawler , spider, data mining, data crawling etc. "screen scraping" software or systems, using automated tools or manual processes, [...] such uses will be prevented at the discretion of the OWNER. [...]

因此您可以确定他们正在尽最大努力防止抓取。

有一些方法可以解决这些问题，我建议您阅读 Thomas Dondorf's great answer 关于无头浏览器和 reCaptcha 阻止的主题。我也强烈建议在当前情况下考虑他的第一个选项：

Option 1: Stop crawling or try to use an official API. As the owner of the page does not want you to crawl that page, you could simply respect that decision and stop crawling. Maybe there is a documented API that you can use.

一般来说，无论是否使用 launch() 的 slowMo 选项，以无头模式与有头模式访问站点之间的爬虫识别可能存在巨大差异。

尽管 IP 轮换无法抓取

Unable to scrape despite IP rotation

node.js

playwright