xml 文件中存在大注释时如何提高 xerces 解析器性能?
How to improve xerces parser performance when large comments are present in xml file?
我正在使用以下代码使用 xerces 2.11 解析 xml 文件:
@Test
public void testXercesPerformance() throws IOException, SAXException, ParserConfigurationException
{
final SAXParserFactory spf = SAXParserFactory.newInstance();
final SAXParser parser = spf.newSAXParser();
final XMLReader xmlReader = parser.getXMLReader();
final InputSource inputSource = new InputSource(new BufferedInputStream(new FileInputStream(new File("./some.xml")), 8192));
xmlReader.parse(inputSource);
}
然而,当 xml 文件仅在开头包含几个 xml 元素并在末尾包含大量注释(总文件大小约为 10MB)时,性能非常差。在解析过程中,解析器连续分配新的字符串,最终分配的字符串总数为 1.3TB(并非同时分配)。解析本身需要 4 分钟才能完成。
我用于测试的文件开头为:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<version>1.0-SNAPSHOT</version>
<artifactId>helloworld-secure</artifactId>
<dependencies>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>7.4.5.v20110725</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-security</artifactId>
<version>7.4.5.v20110725</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
<version>2.5</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>assemble</goal></goals>
<configuration>
<assembleDirectory>target</assembleDirectory>
<programs>
<program>
<mainClass>HelloWorld</mainClass>
<name>webapp</name>
</program>
</programs>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
<!--
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<version>1.0-SNAPSHOT</version>
<artifactId>helloworld-secure</artifactId>
<dependencies>
然后它会重复未注释部分的依赖项数百次,直到达到将近 10MB 的大小并以:
结尾
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>assemble</goal></goals>
<configuration>
<assembleDirectory>target</assembleDirectory>
<programs>
<program>
<mainClass>HelloWorld</mainClass>
<name>webapp</name>
</program>
</programs>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
-->
性能不佳的原因是什么?我应该如何配置解析器以提高性能?
您可以使用 XmlInputFactory
class 提供的 StreamFilter
或 EventFilter
,这 2 个 class 允许您先拦截解析行为给实际读者解析。 StreamCommentFilter
是 class,它将阻止任何评论被解析。我使用了你的示例并制作了一个 20mb 的文件,它在我的计算机上启用或禁用过滤器的情况下快速解析它。我的电脑恰好速度很快,但在速度较慢的电脑上可能会有所不同。
为方便起见导入:
import javax.xml.stream.StreamFilter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMResult;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamSource;
import org.w3c.dom.Document;
// Create our factory and make sure its namespace aware.
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true);
// create the filtered reader that will not allow any comments to be parsed
XMLStreamReader reader = xmlInputFactory.createFilteredReader(
xmlInputFactory.createXMLStreamReader(new StreamSource(new File("./some.xml"))),
new StreamCommentRemovalFilter());
// transform our XmlStreamReader into a Document using a Transformer
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
DOMResult result = new DOMResult();
transformer.transform(new StAXSource(reader), result);
Document document = (Document) result.getNode();
// do something with your document
StreamFilter
实现,这将不允许在解析期间将任何注释实际放入 Java 对象中。
public static class StreamCommentRemovalFilter implements StreamFilter {
@Override
public boolean accept(XMLStreamReader reader) {
// if its a comment dont parse it
if(reader.getEventType() == XMLEvent.COMMENT) {
return false;
}
return true;
}
}
在您的示例中,您还包括名称空间和架构,因此我假设您想要进行一些验证,如果是这样,您仍然可以使用 DOMSource
class 和已解析的 Document
来自上面的代码。
final SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
final Source schemaSource = new StreamSource(new URL("http://maven.apache.org/xsd/maven-4.0.0.xsd").openStream());
final Schema schema = schemaFactory.newSchema(schemaSource);
schema.newValidator().validate(new DOMSource(document.getFirstChild()));
自 2013 年年中以来,该问题已被报告为 XERCESJ-970. It has been fixed in revision 1507079 of xerces-j trunk(好吧,10 多年前)。
问题是 XMLStringBuffer
中的缓冲区呈线性增长,经常需要重新分配。
我的解决方法是使用 r1507079 应用的补丁重建 xerces 2.11。
我正在使用以下代码使用 xerces 2.11 解析 xml 文件:
@Test
public void testXercesPerformance() throws IOException, SAXException, ParserConfigurationException
{
final SAXParserFactory spf = SAXParserFactory.newInstance();
final SAXParser parser = spf.newSAXParser();
final XMLReader xmlReader = parser.getXMLReader();
final InputSource inputSource = new InputSource(new BufferedInputStream(new FileInputStream(new File("./some.xml")), 8192));
xmlReader.parse(inputSource);
}
然而,当 xml 文件仅在开头包含几个 xml 元素并在末尾包含大量注释(总文件大小约为 10MB)时,性能非常差。在解析过程中,解析器连续分配新的字符串,最终分配的字符串总数为 1.3TB(并非同时分配)。解析本身需要 4 分钟才能完成。
我用于测试的文件开头为:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<version>1.0-SNAPSHOT</version>
<artifactId>helloworld-secure</artifactId>
<dependencies>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>7.4.5.v20110725</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-security</artifactId>
<version>7.4.5.v20110725</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
<version>2.5</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>assemble</goal></goals>
<configuration>
<assembleDirectory>target</assembleDirectory>
<programs>
<program>
<mainClass>HelloWorld</mainClass>
<name>webapp</name>
</program>
</programs>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
<!--
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<version>1.0-SNAPSHOT</version>
<artifactId>helloworld-secure</artifactId>
<dependencies>
然后它会重复未注释部分的依赖项数百次,直到达到将近 10MB 的大小并以:
结尾 </dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>assemble</goal></goals>
<configuration>
<assembleDirectory>target</assembleDirectory>
<programs>
<program>
<mainClass>HelloWorld</mainClass>
<name>webapp</name>
</program>
</programs>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
-->
性能不佳的原因是什么?我应该如何配置解析器以提高性能?
您可以使用 XmlInputFactory
class 提供的 StreamFilter
或 EventFilter
,这 2 个 class 允许您先拦截解析行为给实际读者解析。 StreamCommentFilter
是 class,它将阻止任何评论被解析。我使用了你的示例并制作了一个 20mb 的文件,它在我的计算机上启用或禁用过滤器的情况下快速解析它。我的电脑恰好速度很快,但在速度较慢的电脑上可能会有所不同。
为方便起见导入:
import javax.xml.stream.StreamFilter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMResult;
import javax.xml.transform.stax.StAXSource;
import javax.xml.transform.stream.StreamSource;
import org.w3c.dom.Document;
// Create our factory and make sure its namespace aware.
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
xmlInputFactory.setProperty(XMLInputFactory.IS_NAMESPACE_AWARE, true);
// create the filtered reader that will not allow any comments to be parsed
XMLStreamReader reader = xmlInputFactory.createFilteredReader(
xmlInputFactory.createXMLStreamReader(new StreamSource(new File("./some.xml"))),
new StreamCommentRemovalFilter());
// transform our XmlStreamReader into a Document using a Transformer
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
DOMResult result = new DOMResult();
transformer.transform(new StAXSource(reader), result);
Document document = (Document) result.getNode();
// do something with your document
StreamFilter
实现,这将不允许在解析期间将任何注释实际放入 Java 对象中。
public static class StreamCommentRemovalFilter implements StreamFilter {
@Override
public boolean accept(XMLStreamReader reader) {
// if its a comment dont parse it
if(reader.getEventType() == XMLEvent.COMMENT) {
return false;
}
return true;
}
}
在您的示例中,您还包括名称空间和架构,因此我假设您想要进行一些验证,如果是这样,您仍然可以使用 DOMSource
class 和已解析的 Document
来自上面的代码。
final SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
final Source schemaSource = new StreamSource(new URL("http://maven.apache.org/xsd/maven-4.0.0.xsd").openStream());
final Schema schema = schemaFactory.newSchema(schemaSource);
schema.newValidator().validate(new DOMSource(document.getFirstChild()));
自 2013 年年中以来,该问题已被报告为 XERCESJ-970. It has been fixed in revision 1507079 of xerces-j trunk(好吧,10 多年前)。
问题是 XMLStringBuffer
中的缓冲区呈线性增长,经常需要重新分配。
我的解决方法是使用 r1507079 应用的补丁重建 xerces 2.11。