在 Java 中使用 API HtmlUnit 登录并获取网页

Login and get Webpage using API HtmlUnit in Java

我正在尝试获取网页。我正在获取表单、文本输入、复选框和提交按钮,因此我可以通过 java 代码填写这些内容。

首先我收到这些警告(我认为 ScriptEngine 无法加载某些脚本):

oct 18, 2015 9:45:01 AM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
oct 18, 2015 9:45:01 AM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
oct 18, 2015 9:45:01 AM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.

无论如何,在我正确填写 java 输入并调用提交按钮上的方法 click() 之后,我没有得到应该在提交后加载的页面。那么,我错过了什么?

这是 html 代码:

<form name="form" method="post" action="Login.aspx?test=1" onsubmit="javascript:return doSomething_OnSubmit();" id="form">
//then there are some hidden inputs
//...
<input name="tax_code" type="text" maxlength="10" id="tax_code" style="color:Red;width:120px;" />
<input id="privacy" type="checkbox" name="privacy" onclick="activeConfirmButton()" />
//initially the confirm button is deactivated, after the checkbox is checked the confirm button is active with the onclick event added on it.
<input type="submit" name="Confirm" value="Confirm" onclick="javascript:Form_DoPostBack(new Form_DoPostBack())" id="Confirm" style="color:Blue;font-family:calibri;width:150px;Z-INDEX: 0" />

这里是 java 代码:

try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) 
        {
            /* turn off htmlunit warnings */
            //java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);

            //webClient.getOptions().setActiveXNative(true);
            //webClient.waitForBackgroundJavaScript(50000);

            // Get the first page
            final HtmlPage page1 = webClient.getPage("http://example.com/examples/Login.aspx?test=1");

            final HtmlForm form = page1.getFormByName("form");

            final HtmlTextInput taxCodeTextField = form.getInputByName("tax_code");
            final HtmlCheckBoxInput checkboxInput = form.getInputByName("privacy");
            final HtmlSubmitInput confirmButton = form.getInputByName("Confirm");

            //Setting textfield and checkbox
            taxCodeTextField.setValueAttribute("TAX_CODE");
            checkboxInput.setChecked(true);
            //onclick of the checkbox, to activate the confirm button
            checkboxInput.click();

            // onclick of the confirm button
            final HtmlPage page2 = confirmButton.click();

            WebResponse response = page2.getWebResponse();
            String content = response.getContentAsString();
            System.out.println("HTML SOURCE: "+content);

            }
            catch(Exception e){
            }

有几点需要考虑。

  • 单击复选框后,网站将重定向到同一页面,因此必须禁用 HtmlUnit 缓存。
  • 点击复选框应该只点击一次,而不是 .setChecked(true).click()
  • 由于点击发生在背景上,由复选框 onclick 处理程序中的 JavaScript setTimeout() 显示,因此必须获得一个新页面。

下面的代码更新页面并returns结果:

    try (final WebClient webClient = new WebClient(BrowserVersion.CHROME)) {

        // disable caching
        webClient.getCache().setMaxSize(0);

        // Get the first page
        final HtmlPage page1 = webClient.getPage(url);

        final HtmlForm form = page1.getFormByName(formName);

        final HtmlTextInput taxCodeTextField = form.getInputByName(taxCodeTextFieldName);
        HtmlCheckBoxInput checkboxInput = form.getInputByName(checkboxInputName);

        taxCodeTextField.type(taxCode);
        checkboxInput.click();

        //wait a little
        Thread.sleep(2000);

        //get the main page
        HtmlPage page2 = (HtmlPage) webClient.getTopLevelWindows().get(0).getEnclosedPage();

        HtmlSubmitInput confirmButton = page2.getFormByName(formName).getInputByName(confirmButtonName);

        final HtmlPage page3 = confirmButton.click();

        System.out.println(page3.asText());
    }