Java Jsoup - 元素未从元素中删除
Java Jsoup - Element isn't removed from Elements
我将从头开始,html 的模式如下:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
那些 div 有样式并且 table 是可选的,有时只有
<div id="post">
Text i need
</div>
我想将该文本解析为字符串。这是我正在使用的代码
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
我添加了这些打印行来检查它是否找到了它们,是的,它确实找到了正确的,但是稍后当我将它解析为字符串时:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
字符串包含出于某些原因再次删除的所有内容。
我试过用迭代器删除它们,或者用索引填充和删除元素,但结果是一样的。
所以你想得到Text i need
。使用 Element
的 ownText()
方法 Gets the text owned by this element only; does not get the combined text of all children
.
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}
我将从头开始,html 的模式如下:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
那些 div 有样式并且 table 是可选的,有时只有
<div id="post">
Text i need
</div>
我想将该文本解析为字符串。这是我正在使用的代码
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
我添加了这些打印行来检查它是否找到了它们,是的,它确实找到了正确的,但是稍后当我将它解析为字符串时:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
字符串包含出于某些原因再次删除的所有内容。
我试过用迭代器删除它们,或者用索引填充和删除元素,但结果是一样的。
所以你想得到Text i need
。使用 Element
的 ownText()
方法 Gets the text owned by this element only; does not get the combined text of all children
.
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}