在 xml 解析中面临 org.xml.sax.SAXParseException 异常
facing the org.xml.sax.SAXParseException Exception in xml parsing
我在 java spring 启动应用程序中编写了一个调度程序,该应用程序每小时运行一次,自从一个月以来它工作得很好。但是今天它在解析时开始抛出异常。我想可能是 xml(我从中获取的数据已损坏,或者它可能发生了一点我无法弄清楚的变化)。
请注意:我无法更改源数据。
这是我的代码:
@Scheduled(fixedRate = 1*60*60*1000 , initialDelay = 10*1000)
public String updateNewsFeed() {
try {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
String URL = "https://nation.com.pk/rss/coronavirus";
Document doc = db.parse(URL);
List<NewsFeed> newsFeedList = parseNewsItemsToList(doc);
return "Works fine";
} catch (Exception ex) {
return ex.getMessage();
}
}
public List<NewsFeed> parseNewsItemsToList(Document doc) throws Exception{
doc.getDocumentElement().normalize();
NodeList nodes = doc.getElementsByTagName("item");
List<NewsFeed> newsFeedList = new ArrayList<>();
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList title = element.getElementsByTagName("title");
NodeList link = element.getElementsByTagName("link");
NodeList description = element.getElementsByTagName("description");
NodeList pubDate = element.getElementsByTagName("pubDate");
NodeList guid = element.getElementsByTagName("guid");
org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();
/*Elements pngs = htmlDoc.select("picture");
System.out.println("\nimg link:"+pngs.toString());*/
String image = htmlDoc.select("picture").select("img[src~=(?i)\.(png|jpe?g)]").attr("src").trim();
newsFeedList.add(new NewsFeed(
title.item(0).getTextContent().trim(),
description.item(0).getTextContent().trim(),
pubDate.item(0).getTextContent().trim(),
guid.item(0).getTextContent().trim(),
image,
link.item(0).getTextContent().trim()
));
}
return newsFeedList;
}
这是错误消息:
[Fatal Error] coronavirus:195:32: The entity name must immediately follow the '&' in the entity reference. org.xml.sax.SAXParseException; systemId: https://nation.com.pk/rss/coronavirus; lineNumber: 195; columnNumber: 32; The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
问题是 XML 中的 &
符号字符。
<category>Lifestyle & Entertainment</category>
&
在 XML 文档中 CDATA
部分之外是非法的。这必须写成 &
但 XML 文档的制作者已经转义了 &
字符。
如果将 &
替换为 &
,它将起作用。
使用 ROMETOOLS 库(https://rometools.github.io/rome/)
如果您的目标是处理 RSS 提要,我建议使用 rome
库来处理像 &
这样的特殊字符——它简单易用。参考 https://www.baeldung.com/rome-rss
下面的代码片段从 RSS 提要的 <title>
标签打印 International News
:
URL feedSource = new URL("https://nation.com.pk/rss/coronavirus");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());
我在 java spring 启动应用程序中编写了一个调度程序,该应用程序每小时运行一次,自从一个月以来它工作得很好。但是今天它在解析时开始抛出异常。我想可能是 xml(我从中获取的数据已损坏,或者它可能发生了一点我无法弄清楚的变化)。
请注意:我无法更改源数据。
这是我的代码:
@Scheduled(fixedRate = 1*60*60*1000 , initialDelay = 10*1000)
public String updateNewsFeed() {
try {
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
String URL = "https://nation.com.pk/rss/coronavirus";
Document doc = db.parse(URL);
List<NewsFeed> newsFeedList = parseNewsItemsToList(doc);
return "Works fine";
} catch (Exception ex) {
return ex.getMessage();
}
}
public List<NewsFeed> parseNewsItemsToList(Document doc) throws Exception{
doc.getDocumentElement().normalize();
NodeList nodes = doc.getElementsByTagName("item");
List<NewsFeed> newsFeedList = new ArrayList<>();
for (int i = 0; i < nodes.getLength(); i++) {
Element element = (Element) nodes.item(i);
NodeList title = element.getElementsByTagName("title");
NodeList link = element.getElementsByTagName("link");
NodeList description = element.getElementsByTagName("description");
NodeList pubDate = element.getElementsByTagName("pubDate");
NodeList guid = element.getElementsByTagName("guid");
org.jsoup.nodes.Document htmlDoc = Jsoup.connect(link.item(0).getTextContent().trim()).get();
/*Elements pngs = htmlDoc.select("picture");
System.out.println("\nimg link:"+pngs.toString());*/
String image = htmlDoc.select("picture").select("img[src~=(?i)\.(png|jpe?g)]").attr("src").trim();
newsFeedList.add(new NewsFeed(
title.item(0).getTextContent().trim(),
description.item(0).getTextContent().trim(),
pubDate.item(0).getTextContent().trim(),
guid.item(0).getTextContent().trim(),
image,
link.item(0).getTextContent().trim()
));
}
return newsFeedList;
}
这是错误消息:
[Fatal Error] coronavirus:195:32: The entity name must immediately follow the '&' in the entity reference. org.xml.sax.SAXParseException; systemId: https://nation.com.pk/rss/coronavirus; lineNumber: 195; columnNumber: 32; The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:258) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177) at com.i2p.covid19.service.NewsFeedService.updateNewsFeed(NewsFeedService.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access1(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
问题是 XML 中的 &
符号字符。
<category>Lifestyle & Entertainment</category>
&
在 XML 文档中 CDATA
部分之外是非法的。这必须写成 &
但 XML 文档的制作者已经转义了 &
字符。
如果将 &
替换为 &
,它将起作用。
使用 ROMETOOLS 库(https://rometools.github.io/rome/)
如果您的目标是处理 RSS 提要,我建议使用 rome
库来处理像 &
这样的特殊字符——它简单易用。参考 https://www.baeldung.com/rome-rss
下面的代码片段从 RSS 提要的 <title>
标签打印 International News
:
URL feedSource = new URL("https://nation.com.pk/rss/coronavirus");
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedSource));
System.out.println(feed.getTitle());