Apify：在 RequestQueue 中保留 headers

Question

我正在尝试使用 PuppeteerCrawler 抓取我们本地的 Confluence 安装。我的策略是先登录，然后提取 session cookie 并在开始 url 的 header 中使用它们。代码如下：

首先，我登录 'by foot' 以提取相关凭证：

const Apify = require("apify");

const browser = await Apify.launchPuppeteer({sloMo: 500});
const page = await browser.newPage();
await page.goto('https://mycompany/confluence/login.action');

await page.focus('input#os_username');
await page.keyboard.type('myusername');
await page.focus('input#os_password');
await page.keyboard.type('mypasswd');
await page.keyboard.press('Enter');
await page.waitForNavigation();

// Get cookies and close the login session
const cookies = await page.cookies();
browser.close();
const cookie_jsession = cookies.filter( cookie => {
    return cookie.name === "JSESSIONID"
})[0];
const cookie_crowdtoken = cookies.filter( cookie => {
    return cookie.name === "crowd.token_key"
})[0];

然后我用准备好的请求构建爬虫结构 header:

const startURL = {
    url: 'https://mycompany/confluence/index.action',
    method: 'GET',
    headers:
    {
        Accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
        Cookie: `${cookie_jsession.name}=${cookie_jsession.value}; ${cookie_crowdtoken.name}=${cookie_crowdtoken.value}`,
    }
}

const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest(new Apify.Request(startURL));
const pseudoUrls = [ new Apify.PseudoUrl('https://mycompany/confluence/[.*]')];

const crawler = new Apify.PuppeteerCrawler({
    launchPuppeteerOptions: {headless: false, sloMo: 500 },
    requestQueue,
    handlePageFunction: async ({ request, page }) => {

        const title = await page.title();

        console.log(`Title of ${request.url}: ${title}`);
        console.log(page.content());

        await Apify.utils.enqueueLinks({
            page,
            selector: 'a:not(.like-button)',
            pseudoUrls,
            requestQueue
        });

    },
    maxRequestsPerCrawl: 3,
    maxConcurrency: 10,
});

await crawler.run();

by-foot-login 和 cookie 提取似乎没问题（"curlified" 请求完美运行），但 Confluence 不接受通过 puppeteer / headless chromium 登录。 header 似乎不知何故迷路了..

我做错了什么？

Answer 1

在不首先详细说明为什么 headers 不起作用的情况下，我建议在 PuppeteerCrawler 选项中定义自定义 gotoFunction，例如：

{
    // ...
    gotoFunction: async ({ request, page }) => {
        await page.setCookie(...cookies); // From page.cookies() earlier.
        return page.goto(request.url, { timeout: 60000 })
    }
}

这样，您就不需要进行解析，每次加载页面之前，cookie 都会自动注入浏览器。

请注意，在使用无头浏览器时修改默认请求 headers 并不是一个好的做法，因为这可能会导致某些与接收到的 headers 与已知列表相匹配的网站被屏蔽浏览器指纹。

更新：

以下部分不再相关，因为您现在可以使用 Request class 按预期覆盖 headers。

headers问题是一个复杂的问题，涉及Apify SDK中的request interception in Puppeteer. Here's the related GitHub issue。不幸的是，通过 Request object 覆盖 headers 的方法目前在 PuppeteerCrawler 中不起作用，所以这就是您不成功的原因。

Apify：在 RequestQueue 中保留 headers

Apify: Preserve headers in RequestQueue

cookies

puppeteer

apify

更新：