解析 xml 文件时出现异常(3 字节 UTF-8 序列的字节 2 无效)
Exception when parsing xml file (Invalid byte 2 of 3-byte UTF-8 sequence)
我正在尝试从包含无效 UTF-8 字节的外部源解析 xml 文件
使用以下 java 代码
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
我收到以下异常
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)
我意识到 XML 包含无效的 UTF-8 字符,但我希望 XML 解析器能够优雅地处理这个问题,而不是抛出异常
我通过将 java.io.Reader
传递给 DocumentBuilder
而不是 java.io.InputStream
来解决这个问题。所以现在 DocumentBuilder
作用于字符流而不是字节流,并且不会尝试验证字节,因此不会抛出异常。字节到字符的转换现在由 InputStreamReader
完成
所以我改变了
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
到
try (Reader reader = new InputStreamReader(getMyInputStream(), StandardCharsets.UTF_8)) {
Document doc = documentBuilder.parse(new InputSource(reader));
...
}
我正在尝试从包含无效 UTF-8 字节的外部源解析 xml 文件
使用以下 java 代码
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
我收到以下异常
Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)
我意识到 XML 包含无效的 UTF-8 字符,但我希望 XML 解析器能够优雅地处理这个问题,而不是抛出异常
我通过将 java.io.Reader
传递给 DocumentBuilder
而不是 java.io.InputStream
来解决这个问题。所以现在 DocumentBuilder
作用于字符流而不是字节流,并且不会尝试验证字节,因此不会抛出异常。字节到字符的转换现在由 InputStreamReader
所以我改变了
try (InputStream in = getMyInputStream()) {
Document doc = documentBuilder.parse(new InputSource(in));
...
}
到
try (Reader reader = new InputStreamReader(getMyInputStream(), StandardCharsets.UTF_8)) {
Document doc = documentBuilder.parse(new InputSource(reader));
...
}