在哪里可以找到 Chromium 源代码中的完整 HTML 内容

Question

我目前正在尝试这样做：网页加载后，找出 URL 是否具有某种模式（比如 www.wikipedia.com/*），然后，如果是，则解析 HTML 该网页的内容，就像可以用 BeautifulSoup 做的那样，并检查该网页是否有一个 div 和 class foo 和 id boo。知道我在哪里可以写这段代码，也就是说，我在哪里可以访问 URL，我需要在哪里听才能知道网页已经完成加载，然后我可以寻找 URL 和 HTML 内容，以及在哪里以及如何解析 HTML?

我试过 src/chrome/browser/tab_contents 中的代码，但找不到任何合理的地方可以完成所有这些工作。

Answer 1

看看下面的概念应用层，它们代表 Chromium 如何显示网页：

_{图片来源： https://docs.google.com/drawings/d/1gdSTfvLxbJDbX8oiWo5LTwAmXmdMQvjoUhYEhfhj0-k/edit}

不同层 described 为：

WebKit: Rendering engine shared between Safari, Chromium, and all other WebKit-based browsers. The Port is a part of WebKit that integrates with platform dependent system services such as resource loading and graphics.

Glue: Converts WebKit types to Chromium types. This is our "WebKit embedding layer." It is the basis of two browsers, Chromium, and test_shell (which allows us to test WebKit).

Renderer / Render host: This is Chromium's "multi-process embedding layer." It proxies notifications and commands across the process boundary.

WebContents: A reusable component that is the main class of the Content module. It's easily embeddable to allow multiprocess rendering of HTML into a view. See the content module pages for more information.

Browser: Represents the browser window, it contains multiple WebContentses.

Tab Helpers: Individual objects that can be attached to a WebContents (via the WebContentsUserData mixin). The Browser attaches an assortment of them to the WebContentses that it holds (one for favicons, one for infobars, etc).

由于您的目标是通过元素 and/or class 访问和解释网页的 HTML 内容，您可以查看渲染过程 which uses Blink：

The renderers use the Blink open-source layout engine for interpreting and laying out HTML.

Blink 有一个 WebDocument class 允许您访问网页的 HTML 内容和其他属性：

WebDocument document = GetMainFrame()->GetDocument();
WebElement element = document.GetElementById(WebString::FromUTF8("example"));
// document.Url();

Answer 2

最干净的是通过 chrome remote debugging protocol

使用DOM方法获取根DOM和walk, search, or query the dom

这也会使测试更简单：您可以使用现有的客户端库（有很多）以您最喜欢的脚本语言实现逻辑，一旦成功，就可以在 C++ 中实现它。

如果出于某种原因必须在 Chromium 中进行处理，下一步将启动一个连接到此的线程并执行操作。

Answer 3

您需要使用服务器端库来解析请求的HTML页面的内容。例如，在 Java 中有一个库“jsoup”，对于其他服务器端语言可能还有另一种选择。由于安全限制，您可能会发现的主要问题是“禁止访问”，但是因为您没有尝试访问 REST 服务或类似的东西，而只是解析纯 HTML要找到 字符串模式 ，必须使用 "jsoup" 轻松完成。有一个项目，其中编写了类似的东西来访问网站页面并解析响应 html 字符串。

Document doc = Jsoup.connect("http://jsoup.org").get();
Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

参见：https://jsoup.org/

在哪里可以找到 Chromium 源代码中的完整 HTML 内容

Where to find entire HTML content in Chromium source code

google-chrome

chromium

webviewchromium