Public LinkedIn 页面需要在 Puppeteer 中进行身份验证,但在 Chromium/Chrome 中手动粘贴 url 时不需要

Public LinkedIn page requires authentication in Puppeteer but it doesn't when manually pasting the url in Chromium/Chrome

我正在尝试使用 Puppeteer 在 Linkedin 上打开一个 public 公司页面,但每次它都被重定向到身份验证表单。当我在 Chromium 或 Chrome.

中手动粘贴 URL 时,不会发生这种情况

这是代码:

const puppeteer = require("puppeteer");

(async () => {
    const url = "https://www.linkedin.com/company/google/";

    const browser = await puppeteer.launch({
        headless: false,
        args: [
            "--lang=en-GB",
            "--no-sandbox",
            "--disable-setuid-sandbox",
            "--disable-gpu",
            "--disable-dev-shm-usage",
        ],
        defaultViewport: null,
        pipe: true,
        slowMo: 30,
    });

    const page = await browser.newPage();

    await page.goto(url, {
        waitUntil: 'networkidle0',
    });

    await page.waitForSelector(".top-card-layout__entity-info-container", { timeout: 10000 });

    await page.close();
    await browser.close();
})();

这是浏览器被重定向的地方:

如果我在 Chromium 或 Chrome.

中手动粘贴 URL https://www.linkedin.com/company/google/,则不会发生这种情况

到目前为止我尝试过的:

// [...]

const context = await browser.createIncognitoBrowserContext();
const page = await context.newPage();

// [...]
const puppeteer = require("puppeteer-extra");
puppeteer.use(require("puppeteer-extra-plugin-stealth")());

// [...]
const randomUserAgent = require("random-useragent");

// [...]

await page.setUserAgent(randomUserAgent.getRandom());

// [...]

没有任何效果。还有什么我可以尝试的吗?

尝试不同的用户代理。 随便选一个: https://developers.whatismybrowser.com/useragents/explore/software_type_specific/web-browser/

更多关于在 puppeteer 中实现用户代理的信息: https://dev.to/sonyarianto/user-agent-string-difference-in-puppeteer-headless-and-headful-4aoh

编辑:在尝试上述方法之前,也许先尝试隐身 add-on: https://www.npmjs.com/package/puppeteer-extra-plugin-stealth

原因

这是由于微软对配置文件的极端保护。如果您能够以隐身模式访问 public 个人资料,我认为是某些共享 cookie 造成的,但通常情况下,由于 AuthWall(它会阻止您在这种情况下)。对我来说,一直都需要登录,即使是 non-incognito window.

数据专家 John Koala 的一些背景知识:

When Microsoft bought LinkedIn they invested billions into the purchase. They also started to act, quite soon they battled scraping. Companies like the now famous, due to it’s court battle, “HiQ Labs” use the LinkedIn data to make a huge profit.

Now LinkedIn had the problem that public scraping is not a legal offense, they failed (like all other websites) t[o] prevent well developed public scraping.

So LinkedIn added and strengthened a feature called “Authwall”, that is a very sensitive scraping detection. It allows rarely any public views from non authorized accounts making scraping without account impossible.

Scraping with accounts is a legal offense and it’s a lot more difficult as accounts need to be maintained. This is when HiQ Labs and all other scraping companies went out of business. HiQ saw millions of profit going down the sink, they battled LinkedIn at court.

The only company left scraping them is “scraping.services“, it will stay interesting what is going to happen during the next years.

来源:John Koala, Why does LinkedIn no longer allow me to see public profiles without logging in? In: quora

我确信整个 ex-puppeteer 团队现在都在 Microsoft 工作这一事实不会使欺骗 AuthWall 变得更容易(请参阅:即使 puppeteer-extra-plugin-stealth 也无法访问该页面) .


解决方案

稳定访问 LinkedIn 页面的唯一方法是使用表单登录(或使用已登录且已经具有有效会话 cookie 的 chrome 配置文件)。

更新: 由于使用现有帐户抓取自身违反了 LinkedIn user agreement:不建议这样做。我的上述解决方案仅适用于 one-time 次访问(无论如何这都不是有效的场景)。所以最后的答案是:用puppeteer访问这些配置文件是不可能的。