如何在 chrome / chromium headless 上转储超过 <body>?

How to dump more than <body> on chrome / chromium headless?

Chrome 的文档指出:

The --dump-dom flag prints document.body.innerHTML to stdout:

根据标题,如何将更多 DOM object(最好是全部)与 Chromium headless 一起转储?我可以通过开发人员工具手动保存整个 DOM,但我想要一个编程解决方案。

更新 2019-04-23 Google 在 headless front 上非常活跃,发生了许多更新

以下答案适用于 v62 当前版本为 v73,并且一直在更新。 https://www.chromestatus.com/features/schedule

我强烈建议检查 puppeteer 以了解未来使用 headless chrome 进行的任何开发。它由 Google 维护并安装所需的 Chrome 版本和 npm package 所以你只需使用文档中的 puppeteer API 而不必担心 Chrome 版本和在 headless Chrome 和开发工具 API 之间建立连接,这允许完成 99% 的魔法。


更新 2017-10-29 Chrome 已经 --dump-html 标记 returns 完整 HTML,不仅是正文。

v62有,已经在稳定频道了

解决此问题的问题:https://bugs.chromium.org/p/chromium/issues/detail?id=752747

当前 chrome 状态(每个频道的版本)https://www.chromestatus.com/features/schedule

遗留旧答案

You can do it with google chrome remote interface. I have tried it and wasted couple hours trying to launch chrome and get full html, including title and it is just not ready yet, i would say.

It works sometimes but i've tried to run it in production environment and got errors time to time. All kind of random errors like connection reset and no chrome found to kill. Those errors rised up sometimes and it's hard to debug.

I personally use --dump-dom to get html when i need body and when i need title i just use curl for now. Of course chrome can give you title from SPA applications, which can not be done with only curl if title is set from JS. Will switch to google chrome after having stable solution.

Would love to have --dump-html flag on chrome and just get all html. If Google's engineer is reading this, please add such flag to chrome.

I've created issue on Chrome issue tracker, please click favorite "star" to get noticed by google developers:

https://bugs.chromium.org/p/chromium/issues/detail?id=752747

Here is a long list of all kind of flags for chrome, not sure if it's full and all flags: https://peter.sh/experiments/chromium-command-line-switches/ nothing to dump title tag.

This code is from Google's blog post, you can try your luck with this:

const CDP = require('chrome-remote-interface');

...

(async function() {

const chrome = await launchChrome();
const protocol = await CDP({port: chrome.port});

// Extract the DevTools protocol domains we need and enable them.
// See API docs: https://chromedevtools.github.io/devtools-protocol/
const {Page, Runtime} = protocol;
await Promise.all([Page.enable(), Runtime.enable()]);

Page.navigate({url: 'https://www.chromestatus.com/'});

// Wait for window.onload before doing stuff.
Page.loadEventFired(async () => {
  const js = "document.querySelector('title').textContent";
  // Evaluate the JS expression in the page.
  const result = await Runtime.evaluate({expression: js});

  console.log('Title of page: ' + result.result.value);

  protocol.close();
  chrome.kill(); // Kill Chrome.
});

})();

Source: https://developers.google.com/web/updates/2017/04/headless-chrome

您缺少 --headless 获取标准输出。

chromium --incognito \
         --proxy-auto-detect \
         --temp-profile \ 
         --headless \
         --dump-dom https://127.0.0.1:8080/index.html

将其全部传输到 | html2text 中以将 html 重新编译为文本。