解析单页 Web 应用程序

Parse a singlepage web application

java 是否有任何库来解析单页网站,例如使用 AngularJs 创建的网站?

从 jsoup 的官方文档看来,它不适用于 js。

该解决方案不应使用任何已安装的浏览器。

看看下面的link,它可能会解决你的问题。

try jsoup + manual parsing

如@JonasCz 所述,尝试使用 HtmlUnit

代码可能如下所示:

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class Test {
    public static void main(String[] args) {
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
        HtmlPage page = null;
        try {
            page = webClient.getPage("https://docs.angularjs.org/api/ng/service/$http");
        } catch (Exception e) {}

        System.out.println(page.asXml());
        webClient.closeAllWindows();
    }
}

这是使用 AngularJS 和 HtmlUnit

下载页面的正确代码
final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);

webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.setCssErrorHandler(new SilentCssErrorHandler());

webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setAppletEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setPopupBlockerEnabled(true);
webClient.getOptions().setTimeout(10000);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);
webClient.getOptions().setThrowExceptionOnScriptError(true);
webClient.getOptions().setPrintContentOnFailingStatusCode(true);
webClient.waitForBackgroundJavaScript(5000);

try {
    HtmlPage page = webClient.getPage(URL);
    System.out.println(page.asText());
} catch (Exception e) {
    e.printStackTrace();
}
webClient.closeAllWindows();