使用 HTMLUNIT 从标记之间的 HTML 页面中提取数据

Extract data from HTML page in between Tags using HTMLUNIT

我正在尝试使用 Html 单元从网页中提取数据。我已经通过将 Html 页面转换为文本然后使用正则表达式从该 HTML 页面中提取数据来实现这一点。我还实现了使用 Html.

中的 class 属性从 Html table 中提取数据

我想再次完全使用 HtmlUnit 进行所有提取,以了解我使用正则表达式完成的相同要求。我无法获得如何以键值对的形式提取标签内的数据。

这是样本Html数据

<div class="top_red_bar">
    <div id="site-breadcrumbs">
        <a href="/admin/index.jsp" title="Home">Home</a>
        &#124;
        <a href="/admin/queues.jsp" title="Queues">Queues</a>
        &#124;
        <a href="/admin/topics.jsp" title="Topics">Topics</a>
        &#124;
        <a href="/admin/subscribers.jsp" title="Subscribers">Subscribers</a>
        &#124;
        <a href="/admin/connections.jsp" title="Connections">Connections</a>
        &#124;
        <a href="/admin/network.jsp" title="Network">Network</a>
        &#124;
         <a href="/admin/scheduled.jsp" title="Scheduled">Scheduled</a>
        &#124;
        <a href="/admin/send.jsp"
           title="Send">Send</a>
    </div>
    <div id="site-quicklinks"><P>
        <a href="http://activemq.apache.org/support.html"
           title="Get help and support using Apache ActiveMQ">Support</a></p>
    </div>
</div>

<table border="0">
<tbody>
    <tr>
        <td valign="top" width="100%" style="overflow:hidden;">
            <div class="body-content">


<h2>Welcome!</h2>

<p>
Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)
</p>

<p>
You can find more information about Apache ActiveMQ on the <a href="http://activemq.apache.org/">Apache ActiveMQ Site</a>
</p>

<h2>Broker</h2>


<table>
    <tr>
        <td>Name</td>
        <td><b>localhost</b></td>
    </tr>
    <tr>
        <td>Version</td>
        <td><b>5.13.3</b></td>
    </tr>
    <tr>
        <td>ID</td>
        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>
    </tr>
    <tr>
        <td>Uptime</td>
        <td><b>17 days 13 hours</b></td>
    </tr>
    <tr>
        <td>Store percent used</td>
        <td><b>19</b></td>
    </tr>
    <tr>
        <td>Memory percent used</td>
        <td><b>0</b></td>
    </tr>
    <tr>
        <td>Temp percent used</td>
        <td><b>0</b></td>
    </tr>
</table>

我想提取 table 标签之间的数据。 预期输出

Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:7 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0

如何实现?我想知道在 HTLM 单元中使用哪些方法来实现这一点。

这是我遵循的步骤(不是唯一的解决方案)

  1. 通过带有虚拟 url
  2. 的 parseHtml 方法解析字符串
  3. 通过 xpath
  4. 获取第二个 table
  5. 使用双嵌套循环进行迭代(for 和迭代器 - 正确附加分隔符-)

提取表数据:

import java.net.URL;

import com.gargoylesoftware.htmlunit.StringWebResponse;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HTMLParser;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow.CellIterator;


public class ExtractTableData {

    public static void main(String[] args) throws Exception {

        String html = "<div class=\"top_red_bar\">\n" + "                        <div id=\"site-breadcrumbs\">\n"
                + "                            <a href=\"/admin/index.jsp\" title=\"Home\">Home</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/queues.jsp\" title=\"Queues\">Queues</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/topics.jsp\" title=\"Topics\">Topics</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/subscribers.jsp\" title=\"Subscribers\">Subscribers</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/connections.jsp\" title=\"Connections\">Connections</a>\n"
                + "                            &#124;\n"
                + "                            <a href=\"/admin/network.jsp\" title=\"Network\">Network</a>\n"
                + "                            &#124;\n"
                + "                             <a href=\"/admin/scheduled.jsp\" title=\"Scheduled\">Scheduled</a>\n"
                + "                            &#124;\n" + "                            <a href=\"/admin/send.jsp\"\n"
                + "                               title=\"Send\">Send</a>\n" + "                        </div>\n"
                + "                        <div id=\"site-quicklinks\"><P>\n"
                + "                            <a href=\"http://activemq.apache.org/support.html\"\n"
                + "                               title=\"Get help and support using Apache ActiveMQ\">Support</a></p>\n"
                + "                        </div>\n" + "                    </div>\n" + "\n"
                + "                    <table border=\"0\">\n" + "                        <tbody>\n"
                + "                            <tr>\n"
                + "                                <td valign=\"top\" width=\"100%\" style=\"overflow:hidden;\">\n"
                + "                                    <div class=\"body-content\">\n" + "\n" + "\n"
                + "<h2>Welcome!</h2>\n" + "\n" + "<p>\n"
                + "Welcome to the Apache ActiveMQ Console of <b>localhost</b> (ID:TOOLCONTROLPJX526-524666-65544585445-2:3)\n"
                + "</p>\n" + "\n" + "<p>\n"
                + "You can find more information about Apache ActiveMQ on the <a href=\"http://activemq.apache.org/\">Apache ActiveMQ Site</a>\n"
                + "</p>\n" + "\n" + "<h2>Broker</h2>\n" + "\n" + "\n" + "<table>\n" + "    <tr>\n"
                + "        <td>Name</td>\n" + "        <td><b>localhost</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>Version</td>\n" + "        <td><b>5.13.3</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>ID</td>\n" + "        <td><b>ID:TOOLCONTROLPJX526-524666-65544585445-2:3</b></td>\n"
                + "    </tr>\n" + "    <tr>\n" + "        <td>Uptime</td>\n"
                + "        <td><b>17 days 13 hours</b></td>\n" + "    </tr>\n" + "    <tr>\n"
                + "        <td>Store percent used</td>\n" + "        <td><b>19</b></td>\n" + "    </tr>\n"
                + "    <tr>\n" + "        <td>Memory percent used</td>\n" + "        <td><b>0</b></td>\n"
                + "    </tr>\n" + "    <tr>\n" + "        <td>Temp percent used</td>\n" + "        <td><b>0</b></td>\n"
                + "    </tr>\n" + "</table>";
        WebClient webClient = new WebClient();
        HtmlPage page = HTMLParser.parseHtml(new StringWebResponse(html, new URL("http://dummy.url.for.parsing.com/")),
                webClient.getCurrentWindow());

        final HtmlTable table = (HtmlTable) page.getByXPath("//table").get(1);

        for (final HtmlTableRow row : table.getRows()) {

            CellIterator cellIterator = row.getCellIterator();

            if (cellIterator.hasNext()) {
                System.out.print(cellIterator.next().asText());
                while (cellIterator.hasNext()) {
                    System.out.print(":" + cellIterator.next().asText());
                }
            }
            System.out.println();
        }

    }

}

输出:

Name:localhost
Version:5.13.3
ID:ID:TOOLCONTROLPJX526-524666-65544585445-2:3
Uptime:17 days 13 hours
Store percent used:19
Memory percent used:0
Temp percent used:0