使用具有特定编码的 Jsoup 从 html 字符串中提取文本

Question

这是我的 -

String html = "<p><b>Annie's and Lärabar</b></p>"

在运行之后 -

org.jsoup.nodes.Document doc = Jsoup.parse(html);
Element p= doc.select("p").first();
String s = p.text();
System.out.println(s);

输出 - "Annie's and L?rabar".

字符“ä”变成了问号。

我的JVM环境是"iso-8859-1"，好像Jsoup的默认编码是utf-8。我想强制 Jsoup.parse() 在解析 html 字符串时使用 "iso-8859-1"。

我阅读了 API 并用谷歌搜索了示例，但我找不到任何一个示例表明 Jsoup.parse() 在解析字符串时实际上可以采用特定的编码？

有人能帮忙吗？提前致谢！

-辛

Answer 1

您可以将字符集设置为文档，如下所示

org.jsoup.nodes.Document doc = Jsoup.parse(html);
doc.charset(Charset charset);
Element p= doc.select("p").first();
String s = p.text();

Extract text from html string using Jsoup with specific encoding