使用 java 和 jsoup 从 html 标签中提取值
Extract values from html tags using java with jsoup
我是新手,正在使用 jsoup 库 (jsoup-1.14.3)
我有这个html
<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }
table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }
.listingTable { border: solid black 1px; }
.textCommand { font-family: verdana; font-size: 10pt; }
.textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; }
.textData { font-family: verdana; font-size: 10pt; }
.tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; }
.rowOdd { background-color: #eeeeee; }
.rowEven { background-color: #dddddd; }
</style></head>
<body>
<table cellspacing='2' cellpadding='3' border='0' width='100%'>
<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>
<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>
<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>
<tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>
<tr class='rowEven'><td class='textData'><a href="/alfresco/webdav/rep/FLOW%20CHART">FLOW CHART</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr>
<tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/file">file</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr>
</table></body></html>
另外,我正在尝试获取每个标签的 href。
例如,
<table cellspacing='2' cellpadding='3' border='0' width='100%'>
<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>
<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>
<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>
<tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>
我想提取 "/alfresco/webdav/rep/ED" 和 "ED"" 和 "Thu, 05 2017 年 1 月 11:11:14 格林威治标准时间
首先你需要解析html String to Document.
final Document document = Jsoup.parse(html);
然后你需要 select 所有 tr
个包含 a
个标签的标签。
final Elements trElements = document.select("tr:has(a)");
之后,您需要浏览找到的每个 tr
标签:
for (final Element trElement : trElements) {
//Do stuff
}
对于每个 tr 标签,您检索标签的 href
值。但首先,您需要检索 a
标签:
final Element aElement = trElement.select("a").first();
然后,我们检索标签 a
中 href
属性的值。
final String href = aElement.attr("href");
对于名称,您检索 a
标签的文本内容:
final String name = aElement.text();
对于日期,您需要从 tr
标签中检索第四个 td
标签:
final Element dateTdElement = trElement.select("td").get(3);
只需检索值文本即可获取日期内容:
final String date = dateTdElement.text();
NB : 方法 select()
接受一个 css 查询。所有 css 查询都适用于扩展语法,如 ':has()' 和其他部分。有关详细信息,请参阅 Jsoup 文档。
要在一个代码中恢复所有内容:
public static void main(final String[] args) {
final String html = "<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
"table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
".listingTable { border: solid black 1px; }\n" +
".textCommand { font-family: verdana; font-size: 10pt; }\n" +
".textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; }\n" +
".textData { font-family: verdana; font-size: 10pt; }\n" +
".tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; }\n" +
".rowOdd { background-color: #eeeeee; }\n" +
".rowEven { background-color: #dddddd; }\n" +
"</style></head>\n" +
"<body>\n" +
"<table cellspacing='2' cellpadding='3' border='0' width='100%'>\n" +
"<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>\n" +
"<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>\n" +
"<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>\n" +
"<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/ED\">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>\n" +
"<tr class='rowEven'><td class='textData'><a href=\"/alfresco/webdav/rep/FLOW%20CHART\">FLOW CHART</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr>\n" +
"<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/file\">file</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr>\n" +
"\n" +
"\n" +
"</table></body></html>";
final Document document = Jsoup.parse(html);
final Elements trElements = document.select("tr:has(a)");
for (final Element trElement : trElements) {
final Element aElement = trElement.select("a").first();
final String href = aElement.attr("href");
System.out.println("Href : " + href);
final String name = aElement.text();
System.out.println("Name : " + name);
final Element dateTdElement = trElement.select("td").get(3);
final String date = dateTdElement.text();
System.out.println("Date : " + date);
}
}
它打印出如下内容:
Href : /alfresco/webdav/rep/ED
Name : ED
Date : Thu, 05 Jan 2017 11:11:14 GMT
Href : /alfresco/webdav/rep/FLOW%20CHART
Name : FLOW CHART
Date : Thu, 27 Jun 2013 13:30:18 GMT
Href : /alfresco/webdav/rep/file
Name : file
Date : Wed, 10 Nov 2021 13:16:49 GMT
我是新手,正在使用 jsoup 库 (jsoup-1.14.3)
我有这个html
<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }
table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }
.listingTable { border: solid black 1px; }
.textCommand { font-family: verdana; font-size: 10pt; }
.textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; }
.textData { font-family: verdana; font-size: 10pt; }
.tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; }
.rowOdd { background-color: #eeeeee; }
.rowEven { background-color: #dddddd; }
</style></head>
<body>
<table cellspacing='2' cellpadding='3' border='0' width='100%'>
<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>
<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>
<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>
<tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>
<tr class='rowEven'><td class='textData'><a href="/alfresco/webdav/rep/FLOW%20CHART">FLOW CHART</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr>
<tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/file">file</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr>
</table></body></html>
另外,我正在尝试获取每个标签的 href。
例如,
<table cellspacing='2' cellpadding='3' border='0' width='100%'>
<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>
<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>
<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>
<tr class='rowOdd'><td class='textData'><a href="/alfresco/webdav/rep/ED">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>
我想提取 "/alfresco/webdav/rep/ED" 和 "ED"" 和 "Thu, 05 2017 年 1 月 11:11:14 格林威治标准时间
首先你需要解析html String to Document.
final Document document = Jsoup.parse(html);
然后你需要 select 所有 tr
个包含 a
个标签的标签。
final Elements trElements = document.select("tr:has(a)");
之后,您需要浏览找到的每个 tr
标签:
for (final Element trElement : trElements) {
//Do stuff
}
对于每个 tr 标签,您检索标签的 href
值。但首先,您需要检索 a
标签:
final Element aElement = trElement.select("a").first();
然后,我们检索标签 a
中 href
属性的值。
final String href = aElement.attr("href");
对于名称,您检索 a
标签的文本内容:
final String name = aElement.text();
对于日期,您需要从 tr
标签中检索第四个 td
标签:
final Element dateTdElement = trElement.select("td").get(3);
只需检索值文本即可获取日期内容:
final String date = dateTdElement.text();
NB : 方法 select()
接受一个 css 查询。所有 css 查询都适用于扩展语法,如 ':has()' 和其他部分。有关详细信息,请参阅 Jsoup 文档。
要在一个代码中恢复所有内容:
public static void main(final String[] args) {
final String html = "<html><head><title>Alfresco Content Repository</title><style>body { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
"table { font-family: Arial, Helvetica; font-size: 12pt; background-color: white; }\n" +
".listingTable { border: solid black 1px; }\n" +
".textCommand { font-family: verdana; font-size: 10pt; }\n" +
".textLocation { font-family: verdana; font-size: 11pt; font-weight: bold; color: #2a568f; }\n" +
".textData { font-family: verdana; font-size: 10pt; }\n" +
".tableHeading { font-family: verdana; font-size: 10pt; font-weight: bold; color: white; background-color: #2a568f; }\n" +
".rowOdd { background-color: #eeeeee; }\n" +
".rowEven { background-color: #dddddd; }\n" +
"</style></head>\n" +
"<body>\n" +
"<table cellspacing='2' cellpadding='3' border='0' width='100%'>\n" +
"<tr><td colspan='4' class='textLocation'>Directory listing for /rep</td></tr>\n" +
"<tr><td height='10' colspan='4'></td></tr></table><table cellspacing='2' cellpadding='3' border='0' width='100%' class='listingTable'>\n" +
"<tr><td class='tableHeading' width='*'>Name</td><td class='tableHeading' width='10%'>Size</td><td class='tableHeading' width='20%'>Type</td><td class='tableHeading' width='25%'>Modified Date</td></tr>\n" +
"<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/ED\">ED</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 05 Jan 2017 11:11:14 GMT</td></tr>\n" +
"<tr class='rowEven'><td class='textData'><a href=\"/alfresco/webdav/rep/FLOW%20CHART\">FLOW CHART</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Thu, 27 Jun 2013 13:30:18 GMT</td></tr>\n" +
"<tr class='rowOdd'><td class='textData'><a href=\"/alfresco/webdav/rep/file\">file</a></td><td class='textData'> </td><td class='textData'> </td><td class='textData'>Wed, 10 Nov 2021 13:16:49 GMT</td></tr>\n" +
"\n" +
"\n" +
"</table></body></html>";
final Document document = Jsoup.parse(html);
final Elements trElements = document.select("tr:has(a)");
for (final Element trElement : trElements) {
final Element aElement = trElement.select("a").first();
final String href = aElement.attr("href");
System.out.println("Href : " + href);
final String name = aElement.text();
System.out.println("Name : " + name);
final Element dateTdElement = trElement.select("td").get(3);
final String date = dateTdElement.text();
System.out.println("Date : " + date);
}
}
它打印出如下内容:
Href : /alfresco/webdav/rep/ED
Name : ED
Date : Thu, 05 Jan 2017 11:11:14 GMT
Href : /alfresco/webdav/rep/FLOW%20CHART
Name : FLOW CHART
Date : Thu, 27 Jun 2013 13:30:18 GMT
Href : /alfresco/webdav/rep/file
Name : file
Date : Wed, 10 Nov 2021 13:16:49 GMT