SAX 解析异常; systemId:实体的累积大小超出界限

SAXParseException; systemId: cumulative size of entities exceeds bound

早上好,

我必须在 Java 中解析一个巨大的 xml 文件 (2GB)。它有很多标签,但我只需要在一个普通文件中每次写<title><subtext>两个标签的内容,所以我使用SaxParse

到目前为止,我已经设法在输出文件中写入了 1M95 文本,这时出现了这个异常

org.xml.sax.SAXParseException; systemId: filePath; lineNumber: x; columnNumber: y; JAXP00010004 : La taille cumulée des entités est "50 000 001" et dépasse la limite de "50 000 000" définie par "FEATURE_SECURE_PROCESSING".
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203)
    at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400)
    at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327)
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1465)
    at com.sun.org.apache.xerces.internal.impl.XMLScanner.checkEntityLimit(XMLScanner.java:1544)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.handleCharacter(XMLDocumentFragmentScannerImpl.java:1940)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(XMLDocumentFragmentScannerImpl.java:1866)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3058)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:504)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:848)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:777)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327)
    at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)
    at Parsing.main(Class.java:38)

异常的翻译是这样的:

The cumulative size of the entities is "50 000 001" which exceeds the boundary of "50 000 000" defined by "FEATURE_SECURE_PROCESSING".

这是我写的代码:

public class Parsing {

public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException {

    try {
        File inputFile = new File(System.getProperty("user.dir") + "/input.xml");
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();
        UserHandler userhandler = new UserHandler();
        saxParser.parse(inputFile, userhandler);
    } catch (Exception e) {
        e.printStackTrace();
    }

}

public static void doThingOne(String text, String title) throws IOException {

    // Write the text and the title on a file
}


public static void doThingTwo(String text, String title) throws IOException {
    //Write the text and the title on another file

}

class UserHandler extends DefaultHandler {

boolean bText = false;
boolean bTitle = false;
StringBuffer tagTextBuffer; 
StringBuffer tagTitleBuffer; 
String text = null;
String title = null;

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {

    if (qName.equals("title")) {
        tagTitleBuffer = new StringBuffer();
        bTitle = true;
    } else if (qName.equalsIgnoreCase("text")) {
        tagTextBuffer = new StringBuffer();
        bText = true;
    }
}

public void endElement(String uri, String localName, String qName) throws SAXException {
    if (qName.equals("title")) {
        bTitle = false;
        title = tagTextBuffer.toString();

    } else if (qName.equals("text")) {
        text = tagTextBuffer.toString();
        bText = false;
        if (text!=null && title == "One") {
            try {
                Parsing.doThingOne(page, title);
            } catch (IOException e) {
                e.printStackTrace();
            }
        } else if (text != null) {
            try {
                Parsing.doThingTwo(page, title);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

public void characters(char ch[], int start, int length) throws SAXException {
    if (bTitle) {
        tagTitleBuffer.append(new String(ch, start, length));
    } else if (bText) {
        tagTextBuffer.append(new String(ch, start, length));
    }
}
}

感谢您的宝贵时间。

  1. 限制是为了防止 "billion laughs" 攻击。如果您信任 XML 来源,您可以关闭施加此限制的 SECURE_PROCESSING 功能。

  2. 我通常建议使用 Apache Xerces 而不是与 JDK 捆绑的版本。

  3. 您的 characters() 方法代码是错误的:texttitle 元素内容都可以拆分为多个调用来传递,因此您需要累积一个两种情况的缓冲区。

  4. 最好知道为什么会达到实体扩展限制。您的文档是否包含大量对小实体的实体引用,或一些对大实体的引用,或者什么?实体引用是否出现在您感兴趣的文档部分?

关闭FEATURE_SECURE_PROCESSING没有效果(Java8)。 要增加限制,请使用:

System.setProperty("jdk.xml.totalEntitySizeLimit", String.valueOf(Integer.MAX_VALUE));

在 SAXParserFactory.newInstance() 之前;