如何获取站点的 HTML 输出结构

Question

我想这显示了我的怪癖，但我如何才能获得网站的 HTML 介绍？例如，我试图从 Wix 站点检索 HTML 结构（用户在屏幕上实际查看的内容），但我得到的是该站点上存在的大量脚本。我正在做一个用于抓取的小代码测试。非常感谢。

Answer 1

好的，我们开始吧。抱歉耽搁了。

我使用 selenium 加载页面，这样我就可以确保捕获所有标记，即使它是由 ajax 加载的。确保获取独立库，这让我陷入了循环。

检索到 html 后，我将其传递给 jsoup，我用它来遍历文档并删除所有文本。

示例代码如下：

// selenium to grab the html
// i chose to use this to get anything that may be loaded by ajax
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;

// jsoup for parsing the html
import org.jsoup.Jsoup;
import org.jsoup.parser.Parser;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;

import java.io.IOException;


public class Example  {
    public static void main(String[] args) {
        // Create a new instance of the html unit driver
        // Notice that the remainder of the code relies on the interface, 
        // not the implementation.
        WebDriver driver = new FirefoxDriver();

        // And now use this to visit Whosebug
        driver.get("http://whosebug.com/");

        // Get the page source
        String html =  driver.getPageSource();


        Document doc = Jsoup.parse(html, "", Parser.xmlParser());

        for (Element el : doc.select("*")){
            if (!el.ownText().isEmpty()){
                for (TextNode node : el.textNodes())
                    node.remove();
            }
        }

        System.out.println(doc);

        driver.quit();
    }
}

不确定您是否也想去掉属性标签，目前它们被保留了下来。但是，很容易修改代码，以便也删除部分或全部属性标签。

Answer 2

如果您只需要页面中的内容，您可以在每个 url 上使用 ?_escaped_fragment_ 来获取静态内容。

_escaped_fragment_ 是用于 Ajax 抓取的标准方法，用于抓取本质上是动态的或在客户端生成/呈现的页面。

基于 Wix 的网站支持 _escaped_fragment。

如何获取站点的 HTML 输出结构

how to just get the HTML output structure of a site

html

curl

wget

web-scraping