解析 xml 文件时出现异常(3 字节 UTF-8 序列的字节 2 无效)

Exception when parsing xml file (Invalid byte 2 of 3-byte UTF-8 sequence)

我正在尝试从包含无效 UTF-8 字节的外部源解析 xml 文件

使用以下 java 代码

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringComments(true);
factory.setNamespaceAware(false);
DocumentBuilder documentBuilder = factory.newDocumentBuilder();
try (InputStream in = getMyInputStream()) {
    Document doc = documentBuilder.parse(new InputSource(in));
    ...
}

我收到以下异常

Caused by: org.xml.sax.SAXParseException: Invalid byte 2 of 3-byte UTF-8 sequence.
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:262)
    at java.xml/com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
    ... 10 common frames omitted
Caused by: com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence.
    at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(UTF8Reader.java:702)
    at java.xml/com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(UTF8Reader.java:409)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(XMLEntityScanner.java:1904)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.peekChar(XMLEntityScanner.java:508)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2649)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:605)
    at java.xml/com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:534)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:888)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:824)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at java.xml/com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:246)

我意识到 XML 包含无效的 UTF-8 字符,但我希望 XML 解析器能够优雅地处理这个问题,而不是抛出异常

我通过将 java.io.Reader 传递给 DocumentBuilder 而不是 java.io.InputStream 来解决这个问题。所以现在 DocumentBuilder 作用于字符流而不是字节流,并且不会尝试验证字节,因此不会抛出异常。字节到字符的转换现在由 InputStreamReader

完成

所以我改变了

try (InputStream in = getMyInputStream()) {
   Document doc = documentBuilder.parse(new InputSource(in));
   ...
}

try (Reader reader = new InputStreamReader(getMyInputStream(), StandardCharsets.UTF_8)) {
   Document doc = documentBuilder.parse(new InputSource(reader));
   ...
}