如何使用jsoup提取文本值
how to extract text value using jsoup
我正在尝试提取 CheckBoxIsChecked="t"
之后的文本值
p > w|Sdt[CheckBoxIsChecked$='t']
但是 jsoup 似乎忽略了它,我不确定如何阅读此后的文本
我可以使用 java 来做到这一点,但我正在尝试使其通用
有没有类似的东西:
p > w|Sdt[CheckBoxIsChecked$='t'] > first text after...
在此示例中,所需的值为:
我需要这个值,因为 CheckBoxIsChecked 为真
<p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<w:Sdt CheckBox="t" CheckBoxIsChecked="t" >
<span style="font-family:"MS Gothic"">y</span>
</w:Sdt> I Need this value since CheckBoxIsChecked is true
<w:Sdt CheckBox="t" CheckBoxIsChecked="f" >
<span style="font-family:"MS Gothic"">n</span>
</w:Sdt> This is not needed since CheckBoxIsChecked is false
<w:Sdt CheckBox="t" CheckBoxIsChecked="f">
<span style="font-family:"MS Gothic"">n</span>
</w:Sdt> This is not needed since CheckBoxIsChecked is false<o:p/>
您可以使用Element.ownText()
方法提取特定标签旁边的文本。您可以在下面找到根据您的示例创建的示例:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Example {
public static void main(String[] args) {
String html = "<p class=\"MsoNormal\" style=\"margin-bottom:0in;margin-bottom:.0001pt;line-height:normal\">\n" +
"<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"t\" >\n" +
" <span style=\"font-family:"MS Gothic"\">y</span>\n" +
"</w:Sdt> I Need this value since CheckBoxIsChecked is true \n" +
"<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"f\" >\n" +
" <span style=\"font-family:"MS Gothic"\">n</span>\n" +
"</w:Sdt> This is not needed since CheckBoxIsChecked is false \n" +
"<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"f\">\n" +
" <span style=\"font-family:"MS Gothic"\">n</span>\n" +
"</w:Sdt> This is not needed since CheckBoxIsChecked is false<o:p/>";
Document doc = Jsoup.parse(html);
doc.select("p > w|sdt[checkboxischecked=t]").forEach(it -> {
String text = it.ownText();
System.out.println(text);
});
}
}
这里可以运行Demo
我正在尝试提取 CheckBoxIsChecked="t"
之后的文本值p > w|Sdt[CheckBoxIsChecked$='t']
但是 jsoup 似乎忽略了它,我不确定如何阅读此后的文本 我可以使用 java 来做到这一点,但我正在尝试使其通用 有没有类似的东西:
p > w|Sdt[CheckBoxIsChecked$='t'] > first text after...
在此示例中,所需的值为:
我需要这个值,因为 CheckBoxIsChecked 为真
<p class="MsoNormal" style="margin-bottom:0in;margin-bottom:.0001pt;line-height:normal">
<w:Sdt CheckBox="t" CheckBoxIsChecked="t" >
<span style="font-family:"MS Gothic"">y</span>
</w:Sdt> I Need this value since CheckBoxIsChecked is true
<w:Sdt CheckBox="t" CheckBoxIsChecked="f" >
<span style="font-family:"MS Gothic"">n</span>
</w:Sdt> This is not needed since CheckBoxIsChecked is false
<w:Sdt CheckBox="t" CheckBoxIsChecked="f">
<span style="font-family:"MS Gothic"">n</span>
</w:Sdt> This is not needed since CheckBoxIsChecked is false<o:p/>
您可以使用Element.ownText()
方法提取特定标签旁边的文本。您可以在下面找到根据您的示例创建的示例:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class Example {
public static void main(String[] args) {
String html = "<p class=\"MsoNormal\" style=\"margin-bottom:0in;margin-bottom:.0001pt;line-height:normal\">\n" +
"<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"t\" >\n" +
" <span style=\"font-family:"MS Gothic"\">y</span>\n" +
"</w:Sdt> I Need this value since CheckBoxIsChecked is true \n" +
"<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"f\" >\n" +
" <span style=\"font-family:"MS Gothic"\">n</span>\n" +
"</w:Sdt> This is not needed since CheckBoxIsChecked is false \n" +
"<w:Sdt CheckBox=\"t\" CheckBoxIsChecked=\"f\">\n" +
" <span style=\"font-family:"MS Gothic"\">n</span>\n" +
"</w:Sdt> This is not needed since CheckBoxIsChecked is false<o:p/>";
Document doc = Jsoup.parse(html);
doc.select("p > w|sdt[checkboxischecked=t]").forEach(it -> {
String text = it.ownText();
System.out.println(text);
});
}
}
这里可以运行Demo