如何使用Jsoup获取html数据的特定子元素
How to get specific sub-elements of html data using Jsoup
所以我正在尝试使用 Jsoup 从 Html 文件中获取所有价格。简化的 Html 结构如下:
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
9
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_ECONOMY_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for Economy (lowest)</span>
</div>
</div>
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
,046
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_MIN-BUSINESS-OR-FIRST_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for First (2-cabin, lowest)</span>
</div>
<div class="pp-remaining-seats">5 tickets left at this price</div>
</div>
//some html
这是我目前尝试过的方法:
File input = new File("Flights.html");
Document document = Jsoup.parse(input, "UTF-8", "");
Elements prices = document.getElementsByClass("price-point");
for(Element e: prices){
System.out.println(e.toString());
}
这给了我以下结果:
<div class="price-point price-point-revised use-roundtrippricing">
9
</div>
<div class="price-point price-point-revised use-roundtrippricing">
,046
</div>
.....
但现在我只想要这样的价格:
509
1046
我通过在打印时仅保留数字 e.toString().replaceAll("\D+","")
来尝试正则表达式,这似乎可行,但这不是我想要实现的方式。如何使用 Jsoup 只获取数字?
感谢@Eritrean 的评论,我需要使用 e.text()
而不是 e.toString()
这给了我
9
,046
我仍然需要使用像 e.replaceAll("[$,]", "")
这样的正则表达式来去掉美元符号。
所以我正在尝试使用 Jsoup 从 Html 文件中获取所有价格。简化的 Html 结构如下:
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
9
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_ECONOMY_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for Economy (lowest)</span>
</div>
</div>
//some html
<div class="price-point-wrap use-roundtrippricing">
<div class="price-point-wrap-top use-roundtrippricing">
<div class="pp-from-total use-roundtrippricing">Roundtrip</div>
</div>
<div class="price-point price-point-revised use-roundtrippricing">
,046
</div>
<div class="fare-select-button-div">
<input type="button" aria-describedby="sr_product_MIN-BUSINESS-OR-FIRST_123-745|1975-UA" value="Select" class="fare-select-button">
<span class="visuallyhidden">fare for First (2-cabin, lowest)</span>
</div>
<div class="pp-remaining-seats">5 tickets left at this price</div>
</div>
//some html
这是我目前尝试过的方法:
File input = new File("Flights.html");
Document document = Jsoup.parse(input, "UTF-8", "");
Elements prices = document.getElementsByClass("price-point");
for(Element e: prices){
System.out.println(e.toString());
}
这给了我以下结果:
<div class="price-point price-point-revised use-roundtrippricing">
9
</div>
<div class="price-point price-point-revised use-roundtrippricing">
,046
</div>
.....
但现在我只想要这样的价格:
509
1046
我通过在打印时仅保留数字 e.toString().replaceAll("\D+","")
来尝试正则表达式,这似乎可行,但这不是我想要实现的方式。如何使用 Jsoup 只获取数字?
感谢@Eritrean 的评论,我需要使用 e.text()
而不是 e.toString()
这给了我
9
,046
我仍然需要使用像 e.replaceAll("[$,]", "")
这样的正则表达式来去掉美元符号。