Puppeteer:如何只等待第一个响应 (HTML)

Puppeteer: how to wait only first response (HTML)

我正在使用 puppeteer-cluster 来抓取网页。

如果我在每个网站上同时打开很多页面(8-10 页),连接速度会变慢并且会出现很多超时错误,如下所示:

TimeoutError:超出导航超时:超过 30000 毫秒

我只需要访问每个页面的 HTML 代码。我不需要等待 domcontentloaded 等等。

有没有办法告诉 page.goto() 只等待网络服务器的第一个响应?或者我需要使用其他技术来代替人偶操作?

domcontentloaded 是第一个 html 内容的事件。

The DOMContentLoaded event fires when the initial HTML document has been completely loaded and parsed, without waiting for stylesheets, images, and subframes to finish loading.

以下内容将在加载初始 HTML 文档时完成加载。

await page.goto(url, {waitUntil: 'domcontentloaded'})

但是,您可以阻止图像或样式表以节省带宽并在一次加载 10 个页面时加载得更快。

将下面的代码放在正确的位置(在使用 page.goto 导航之前),它将停止加载图像、样式表、字体和脚本。

await page.setRequestInterception(true);
page.on('request', (request) => {
    if (['image', 'stylesheet', 'font', 'script'].indexOf(request.resourceType()) !== -1) {
        request.abort();
    } else {
        request.continue();
    }
});

@user3817605,我有完美的代码给你。 :)

/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified page `event` happens.
 * 
 * @param {puppeteer.Page} page
 * @param {string} event Can be any event accepted by the method `page.on()`. E.g.: "requestfinished" or "framenavigated".
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForEvent(page, event, timeout) {
  page.once(event, done);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    clearTimeout(timeoutId);
    fulfill();
  }
}

你要求一个函数只等待第一个响应,所以你像这样使用这个函数:

page.goto(<URL>); // use .catch(() => {}) if you kill the page too soon, to avoid throw errors on console
await waitForEvent(page, 'response'); // after this line here you alread have the html response received

这正是您所要求的。但请注意 "response received" 与 "complete html response received" 不同。第一个是响应的开始,最后一个是响应的结束。所以,也许您想使用事件 "requestfinished" 代替 "response"。事实上,您可以使用 puppeteer Page 接受的任何事件。他们是: 关闭、控制台、对话框、domcontentloaded、错误、frameattached、framedetached、framenavigated、加载、指标、pageerror、弹出窗口、请求、requestfailed、requestfinished、响应、workercreated、workerdestroyed。

尝试使用这些:requestfinished 或 framenavigated。也许它们适合你。

为了帮助您决定哪一个最适合您,您可以像这样设置一个测试代码:

const puppeteer = require('puppeteer');

/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified page `event` happens.
 * 
 * @param {puppeteer.Page} page
 * @param {string} event Can be any event accepted by the method `page.on()`. E.g.: "requestfinished" or "framenavigated".
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForEvent(page, event, timeout) {
  page.once(event, done);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    clearTimeout(timeoutId);
    fulfill();
  }
}

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const cdp = await page.target().createCDPSession();
  await cdp.send('Network.enable');
  await cdp.send('Page.enable');
  const t0 = Date.now();
  page.on('request', req => console.log(`> ${Date.now() - t0} request start: ${req.url()}`));
  page.on('response', req => console.log(`< ${Date.now() - t0} response: ${req.url()}`));
  page.on('requestfinished', req => console.log(`. ${Date.now() - t0} request finished: ${req.url()}`));
  page.on('requestfailed', req => console.log(`E ${Date.now() - t0} request failed: ${req.url()}`));

  page.goto('https://www.google.com').catch(() => { });
  await waitForEvent(page, 'requestfinished');
  console.log(`\nThe page was released after ${Date.now() - t0}ms\n`);
  await page.close();
  await browser.close();
})();

/* The output should be something like this:

> 2 request start: https://www.google.com/
< 355 response: https://www.google.com/
> 387 request start: https://www.google.com/tia/tia.png
> 387 request start: https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
. 389 request finished: https://www.google.com/

The page was released after 389ms

*/

我可以看到另外两种方法来实现您想要的:使用 page.waitForResponsepage.waitForFunction。让我们看看两者。

使用 page.waitForResponse 你可以做一些简单的事情:

page.goto('https://www.google.com/').catch(() => {});
await page.waitForResponse('https://www.google.com/'); // don't forget to put the final slash

很简单,嗯?如果您不喜欢它,请尝试 page.waitForFunction 并等待创建 de document

page.goto('https://www.google.com/').catch(() => {});
await page.waitForFunction(() => document); // you can use `window` too. It is almost the same

此代码将等待 document 存在。当 html 的第一位到达并且浏览器开始创建文档的 DOM 树表示时,就会发生这种情况。

但请注意,尽管这两个解决方案很简单,但它们都不会等到整个 html page/document 下载完毕。如果需要,您应该修改我的其他答案的 waitForEvent 功能,以接受您想要完整下载的特定 url。示例:

/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified `requestUrl` resource has finished downloading, or `timeout` elapses.
 * 
 * @param {puppeteer.Page} page
 * @param {string} requestUrl pass the exact url of the resource you want to wait for. Paths must be ended with slash "/". Don't forget that.
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForRequestToFinish(page, requestUrl, timeout) {
  page.on('requestfinished', onRequestFinished);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    page.removeListener('requestfinished', onRequestFinished);
    clearTimeout(timeoutId);
    fulfill();
  }
  function onRequestFinished(req) {
    if (req.url() === requestUrl) done();
  }
}

使用方法:

page.goto('https://www.amazon.com/').catch(() => {});
await waitForRequestToFinish(page, 'https://www.amazon.com/', 3000);

显示整洁的完整示例console.logs:

const puppeteer = require('puppeteer');

/**
 * The methods `page.waitForNavigation` and `frame.waitForNavigation` wait for the page
 * event `domcontentloaded` at minimum. This function returns a promise that resolves as
 * soon as the specified `requestUrl` resource has finished downloading, or `timeout` elapses.
 * 
 * @param {puppeteer.Page} page
 * @param {string} requestUrl pass the exact url of the resource you want to wait for. Paths must be ended with slash "/". Don't forget that.
 * @param {number} [timeout] optional time to wait. If not specified, waits forever.
 */
function waitForRequestToFinish(page, requestUrl, timeout) {
  page.on('requestfinished', onRequestFinished);
  let fulfill, timeoutId = (typeof timeout === 'number' && timeout >= 0) ? setTimeout(done, timeout) : -1;
  return new Promise(resolve => fulfill = resolve);
  function done() {
    page.removeListener('requestfinished', onRequestFinished);
    clearTimeout(timeoutId);
    fulfill();
  }
  function onRequestFinished(req) {
    if (req.url() === requestUrl) done();
  }
}

(async () => {
  const netMap = new Map();
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const cdp = await page.target().createCDPSession();
  await cdp.send('Network.enable');
  await cdp.send('Page.enable');
  const t0 = Date.now();
  cdp.on('Network.requestWillBeSent', ({ requestId, request: { url: requestUrl } }) => {
    netMap.set(requestId, requestUrl);
    console.log(`> ${Date.now() - t0}ms\t requestWillBeSent:\t${requestUrl}`);
  });
  cdp.on('Network.responseReceived', ({ requestId }) => console.log(`< ${Date.now() - t0}ms\t responseReceived:\t${netMap.get(requestId)}`));
  cdp.on('Network.dataReceived', ({ requestId, dataLength }) => console.log(`< ${Date.now() - t0}ms\t dataReceived:\t\t${netMap.get(requestId)} ${dataLength} bytes`));
  cdp.on('Network.loadingFinished', ({ requestId }) => console.log(`. ${Date.now() - t0}ms\t loadingFinished:\t${netMap.get(requestId)}`));
  cdp.on('Network.loadingFailed', ({ requestId }) => console.log(`E ${Date.now() - t0}ms\t loadingFailed:\t${netMap.get(requestId)}`));

  // The magic happens here
  page.goto('https://www.amazon.com').catch(() => { });
  await waitForRequestToFinish(page, 'https://www.amazon.com/', 3000);

  console.log(`\nThe page was released after ${Date.now() - t0}ms\n`);
  await page.close();
  await browser.close();
})();

/* OUTPUT EXAMPLE
[... lots of logs removed ...]
> 574ms  requestWillBeSent:     https://images-na.ssl-images-amazon.com/images/I/71vvXGmdKWL._AC_SY200_.jpg
< 574ms  dataReceived:          https://www.amazon.com/ 65536 bytes
< 624ms  responseReceived:      https://images-na.ssl-images-amazon.com/images/G/01/AmazonExports/Fuji/2019/February/Dashboard/computer120x._CB468850970_SY85_.jpg
> 628ms  requestWillBeSent:     https://images-na.ssl-images-amazon.com/images/I/81Hhc9zh37L._AC_SY200_.jpg
> 629ms  requestWillBeSent:     https://images-na.ssl-images-amazon.com/images/G/01/personalization/ybh/loading-4x-gray._CB317976265_.gif
< 631ms  dataReceived:          https://www.amazon.com/ 58150 bytes
. 631ms  loadingFinished:       https://www.amazon.com/

*/

此代码显示大量请求和响应,但代码在“https://www.amazon.com/”已完全下载后立即停止。