jsoup 通过选择器获取元素内部文本

jsoup get elements inner text by selector

我在这里寻求有关在 jsoup 中使用选择器模式的帮助 基本上我是根据需要修改别人的代码

例如 href ,它是这样完成的

Elements links = doc.select("a[href]");
for (Element link : links) {
    // get the value from href attribute
    System.out.println("\nlink : " + link.attr("href"));
    System.out.println("text : " + link.text());
}

我指的是这里,但不确定使用哪一个 http://jsoup.org/apidocs/org/jsoup/select/Selector.html

我想查找 "Running map Tasks,1" 等值

<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
<tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
<br>
<hr>

如何获取所有标签内的文本?

我还应该寻找像 "Cluster Summary" 这样的 header ,这样我就可以在 URL

的其余部分使用或相应地使用这个
<h2 id="running_jobs">Running Jobs</h2>
<table border="1" cellpadding="5" cellspacing="0">
<thead><tr><th><b>Jobid</b></th><th><b>Priority</b></th><th><b>User</b></th><th><b>Name</b></th><th><b>Start Time</b></th><th><b>Map % Complete</b></th><th><b>Current Map Slots</b></th><th><b>Failed MapAttempts</b></th><th><b>MapAttempt Time Avg/Max</b></th><th><b>Cumulative Map CPU</b></th><th><b>Current Map PMem</b></th><th><b>Reduce % Complete</b></th><th><b>Current Reduce Slots</b></th><th><b>FailedReduce Attempts</b></th><th><b>ReduceAttempt Time Avg/Max</b></th><th><b>Cumulative Reduce CPU</b></th><th><b>Current Reduce PMem</b></th></tr>
</thead><tbody><tr><td id="job_0"><a href="jobdetails.jsp?jobid=job_201502130313_1511&refresh=30">job_201502130313_1511</a></td><td id="priority_0">NORMAL</td><td id="user_0">vdeadmin</td><td id="name_0">streamjob1942665573586845283.jar</td><td>Fri Feb 13 17:00:17 PST 2015</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td><a href="jobtasks.jsp?jobid=job_201502130313_1511&type=map&pagenum=1&state=running">1</a></td><td>0</td><td>0sec/0sec</td><td>1hrs, 30mins, 4sec</td><td>703.48 MB</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td>0</td><td>0</td><td>0sec/0sec</td><td>0sec</td><td> 0 KB</td></tr>

Updates/additions 到问题 我的 URL 将包含长 HTML,我应该能够明智地搜索特定的组。我的意思是我的搜索应该是逐块搜索...我不想从 html 中找到所有 tr ......但特定于一个 table 等等 例如在下面,我试图仅显示 id="running job" 的结果,然后显示其他一些结果。这样做时我不应该从 html

的其他部分得到结果
<h2 id="running_jobs">Running Jobs</h2>
<table border="1" cellpadding="5" cellspacing="0">
<thead><tr><th><b>Jobid</b></th><th><b>Priority</b></th><th><b>User</b></th><th><b>Name</b></th><th><b>Start Time</b></th><th><b>Map % Complete</b></th><th><b>Current Map Slots</b></th><th><b>Failed MapAttempts</b></th><th><b>MapAttempt Time Avg/Max</b></th><th><b>Cumulative Map CPU</b></th><th><b>Current Map PMem</b></th><th><b>Reduce % Complete</b></th><th><b>Current Reduce Slots</b></th><th><b>FailedReduce Attempts</b></th><th><b>ReduceAttempt Time Avg/Max</b></th><th><b>Cumulative Reduce CPU</b></th><th><b>Current Reduce PMem</b></th></tr>
</thead><tbody><tr><td id="job_0"><a href="jobdetails.jsp?jobid=job_201502130313_1511&refresh=30">job_201502130313_1511</a></td><td id="priority_0">NORMAL</td><td id="user_0">vdeadmin</td><td id="name_0">streamjob1942665573586845283.jar</td><td>Fri Feb 13 17:00:17 PST 2015</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td><a href="jobtasks.jsp?jobid=job_201502130313_1511&type=map&pagenum=1&state=running">1</a></td><td>0</td><td>0sec/0sec</td><td>1hrs, 30mins, 4sec</td><td>703.48 MB</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td>0</td><td>0</td><td>0sec/0sec</td><td>0sec</td><td> 0 KB</td></tr>
</tbody></table>

您只需要知道CSS selectors and how 的用途即可。
在您的情况下,要获取所有 "tr th" 标签中的文本,您应该使用以下代码:

Elements trThs = doc.select("tr th");
for(Element trTh : trThs)
    System.out.println("text : " + trTh.text());