Jsoup 在极少数情况下无法解析元素
Jsoup fails to parse elements on very rare occasion
我最近在我的应用程序中从 rome to jsoup 迁移了 RSS 解析,当尝试从源解析文件时,Jsoup 将无法正确解析 <
和 >
,导致检索到的 Document
中的 <
和 >
,进一步导致在尝试使用 Document::select
.
时出现问题
MCVE
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import java.io.IOException;
import java.util.Collection;
public class MCVE {
public static void main(final String[] args) throws IOException {
Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
.parser(Parser.xmlParser())
.get()
.select("item")
.stream()
.map(e -> e.select("pubDate"))
.flatMap(Collection::stream)
.map(Element::text)
.forEach(System.out::println);
}
}
以上代码目前(RSS 提要不断更新,本地文件不会出现问题)打印以下内容:
Wed, 22 Nov 2017 15:29:54 GMT
Wed, 22 Nov 2017 15:29:43 GMT
Wed, 22 Nov 2017 15:29:36 GMT
Wed, 22 Nov 2017 15:29:28 GMT
Wed, 22 Nov 2017 15:29:22 GMT
Wed, 22 Nov 2017 15:27:23 GMT
Tue, 21 Nov 2017 23:23:23 GMT
Tue, 21 Nov 2017 19:21:38 GMT
Tue, 21 Nov 2017 19:20:12 GMT
Tue, 21 Nov 2017 19:18:15 GMT
Tue, 21 Nov 2017 19:16:17 GMT
Tue, 21 Nov 2017 19:14:37 GMT
Tue, 21 Nov 2017 19:13:34 GMT
Tue, 21 Nov 2017 19:11:33 GMT
Tue, 21 Nov 2017 19:07:49 GMT
Tue, 21 Nov 2017 19:06:56 GMT
Tue, 21 Nov 2017 19:04:19 GMT
Tue, 21 Nov 2017 19:03:57 GMT
Tue, 21 Nov 2017 10:11:11 GMT
Tue, 21 Nov 2017 04:54:00 GMT
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
Mon, 20 Nov 2017 22:22:00 GMT
Mon, 20 Nov 2017 16:16:00 GMT
Mon, 20 Nov 2017 16:15:00 GMT
Mon, 20 Nov 2017 16:14:00 GMT
这是 Jsoup 返回给我的 Document
中的一个片段。
<item>
<title>Ubuntu Security Notice USN-3483-2</title>
<link>
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</link>
<guid isPermaLink="true">
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</guid>
<comments>
https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html
</comments>
<pubDate>
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> <!-- the affected line -->
<description>
Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
</description>
<category></category>
</pubDate>
</item>
此处,部分字符解析错误,而网站上的 xml 格式正确。
当使用相同的 URL 和尾部斜杠 (https://rss.packetstormsecurity.com/files/page18/
) 时,问题不会出现在同一页面上,而是会出现在不同的页面上。
由于 Feed 的活动性质,出现问题的 Feed 页面也会发生变化。如果问题在第18页没有出现,我会更新一个新的页面。如果单独下载文件然后用 Jsoup::parse
.
解析也不会发生
Jsoup版本为1.11.2.
额外的 MCVE
这个MCVE显示只有在Parsing the response with Jsoup时才会出现这个问题,实际下载的XML没问题:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
public class MCVE {
public static void main(final String[] args) throws IOException {
final Connection.Response response = Jsoup.connect("https://rss.packetstormsecurity.com/files/page18").execute();
// Well formed XML
System.out.println(response.body());
// Malformed XML
System.out.println(response.parse());
}
}
这似乎是 org.jsoup.helper.HttpConnection::get
和 org.jsoup.helper.HttpConnection.Response::parse
中的错误,here's my corresponding github issue and here's a repo 复制了该错误。
我最近在我的应用程序中从 rome to jsoup 迁移了 RSS 解析,当尝试从源解析文件时,Jsoup 将无法正确解析 <
和 >
,导致检索到的 Document
中的 <
和 >
,进一步导致在尝试使用 Document::select
.
MCVE
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
import java.io.IOException;
import java.util.Collection;
public class MCVE {
public static void main(final String[] args) throws IOException {
Jsoup.connect("https://rss.packetstormsecurity.com/files/page18")
.parser(Parser.xmlParser())
.get()
.select("item")
.stream()
.map(e -> e.select("pubDate"))
.flatMap(Collection::stream)
.map(Element::text)
.forEach(System.out::println);
}
}
以上代码目前(RSS 提要不断更新,本地文件不会出现问题)打印以下内容:
Wed, 22 Nov 2017 15:29:54 GMT
Wed, 22 Nov 2017 15:29:43 GMT
Wed, 22 Nov 2017 15:29:36 GMT
Wed, 22 Nov 2017 15:29:28 GMT
Wed, 22 Nov 2017 15:29:22 GMT
Wed, 22 Nov 2017 15:27:23 GMT
Tue, 21 Nov 2017 23:23:23 GMT
Tue, 21 Nov 2017 19:21:38 GMT
Tue, 21 Nov 2017 19:20:12 GMT
Tue, 21 Nov 2017 19:18:15 GMT
Tue, 21 Nov 2017 19:16:17 GMT
Tue, 21 Nov 2017 19:14:37 GMT
Tue, 21 Nov 2017 19:13:34 GMT
Tue, 21 Nov 2017 19:11:33 GMT
Tue, 21 Nov 2017 19:07:49 GMT
Tue, 21 Nov 2017 19:06:56 GMT
Tue, 21 Nov 2017 19:04:19 GMT
Tue, 21 Nov 2017 19:03:57 GMT
Tue, 21 Nov 2017 10:11:11 GMT
Tue, 21 Nov 2017 04:54:00 GMT
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
Mon, 20 Nov 2017 22:22:00 GMT
Mon, 20 Nov 2017 16:16:00 GMT
Mon, 20 Nov 2017 16:15:00 GMT
Mon, 20 Nov 2017 16:14:00 GMT
这是 Jsoup 返回给我的 Document
中的一个片段。
<item>
<title>Ubuntu Security Notice USN-3483-2</title>
<link>
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</link>
<guid isPermaLink="true">
https://packetstormsecurity.com/files/145055/USN-3483-2.txt
</guid>
<comments>
https://packetstormsecurity.com/files/145055/Ubuntu-Security-Notice-USN-3483-2.html
</comments>
<pubDate>
Tue, 21 Nov 2017 04:04:00 GMT</pubDate> <!-- the affected line -->
<description>
Ubuntu Security Notice 3483-2 - USN-3483-1 fixed a vulnerability in procmail. This update provides the corresponding update for Ubuntu 12.04 ESM. Jakub Wilk discovered that the formail tool incorrectly handled certain malformed mail messages. An attacker could use this flaw to cause formail to crash, resulting in a denial of service, or possibly execute arbitrary code. Various other issues were also addressed.
</description>
<category></category>
</pubDate>
</item>
此处,部分字符解析错误,而网站上的 xml 格式正确。
当使用相同的 URL 和尾部斜杠 (https://rss.packetstormsecurity.com/files/page18/
) 时,问题不会出现在同一页面上,而是会出现在不同的页面上。
由于 Feed 的活动性质,出现问题的 Feed 页面也会发生变化。如果问题在第18页没有出现,我会更新一个新的页面。如果单独下载文件然后用 Jsoup::parse
.
Jsoup版本为1.11.2.
额外的 MCVE
这个MCVE显示只有在Parsing the response with Jsoup时才会出现这个问题,实际下载的XML没问题:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import java.io.IOException;
public class MCVE {
public static void main(final String[] args) throws IOException {
final Connection.Response response = Jsoup.connect("https://rss.packetstormsecurity.com/files/page18").execute();
// Well formed XML
System.out.println(response.body());
// Malformed XML
System.out.println(response.parse());
}
}
这似乎是 org.jsoup.helper.HttpConnection::get
和 org.jsoup.helper.HttpConnection.Response::parse
中的错误,here's my corresponding github issue and here's a repo 复制了该错误。