如何在 HtmlUnit 中的页面请求和 DOM 响应之间添加一些等待时间？

Question

enter image description hereI'm trying to get all links related to a certain webpage (https://digital.utc.com/our-latest) 使用 HtmlUnit，但显然，它不会检索页面内的所有链接

我试图在检索 DOM 之前为 HtmlUnit 添加一些等待时间，然后将其添加到 HtmlPage.I 怀疑 HtmlUnit 检索 DOM 并分配它一旦它使用 "WebClient.getpage()" 连接到网页，就转到 html 页面，而不留任何时间让页面从数据库加载数据。但我找不到使用 HtmlUnit

的任何方法

public void pageScrapping() throws FailingHttpStatusCodeException, MalformedURLException, IOException
    {
        //Initializing the WebClient 
        WebClient webClient = new WebClient();
        webClient.setThrowExceptionOnScriptError(false);
        webClient.setThrowExceptionOnFailingStatusCode(false);
        webClient.setCssEnabled(false);
        webClient.setJavaScriptEnabled(false);
        webClient.setTimeout(10000);

        HtmlPage page = webClient.getPage("https://digital.utc.com/our-latest");

        try 
        {
            Thread.sleep(3000);
        }

        catch (InterruptedException e) 
        {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        page = page.getPage();
        String htmlContent2 = page.asXml();
        File htmlFile2 = new File("Website2_XML.html");
        PrintWriter pw2 = new PrintWriter(htmlFile2);
        pw2.print(htmlContent2);
        pw2.close();

        System.out.println(page.getTitleText());

        DomNodeList<HtmlElement> links = (DomNodeList<HtmlElement>) page.getElementsByTagName("a");

        for (HtmlElement domElement : links) 
        {
            System.out.println(domElement.getAttribute("href"));
            System.out.println();
        }

    }

我期望的是 HtmlUnit 将 return 在网页中找到具有 'href' 属性的整个链接
HtmlUnit returned 的实际结果有一些丢失的链接，即使浏览器检查器 returned 正确

** 缺失的链接将在从数据库中检索到的表单或文章列表的右侧找到

Answer 1

我看到（使用此代码）没有 href 的唯一链接是带有 onClick 处理程序的锚点。你能否添加更多关于你错过的细节。

    final String url = "https://digital.utc.com/our-latest";

    try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setCssEnabled(false);;
        webClient.getOptions().setJavaScriptEnabled(false);

        HtmlPage page = webClient.getPage(url);
        webClient.waitForBackgroundJavaScript(4_000);

        System.out.println(page.asXml());

        DomNodeList<DomElement> links = page.getElementsByTagName("a");
        for (DomElement domElement : links)
        {
            String href = domElement.getAttribute("href");
            System.out.println(domElement.asXml());
        }
    }

一如既往地确保您使用的是最新的 SNAPSHOT 版本。

更新：对媒体查询处理做了一个小修复，以避免运行我的代码时您面临的 NPE。请使用最新的 SNAPSHOT 版本。

如何在 HtmlUnit 中的页面请求和 DOM 响应之间添加一些等待时间？

How to add some wait time between the page request and DOM response in HtmlUnit?

java

htmlunit