Java 具有持续验证的 XMLReader:未定义的行为
Java XMLReader with continuous validation: undefined behaviour
我正在解析和验证 (xsd) 长 XML(总是 well-formed)文件,报告所有验证问题.
我的解析器报告并继续处理错误,但有一个奇怪的例外:当一个由多个节点 (children) 组成的节点 (parent) 在任何节点上验证失败时child 个节点,对所有 children 继续正确解析,但验证停止直到下一个 parent 个节点开始 。
考虑简单 XSD:
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="customerDataFile">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="customerList"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="customerList">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="customerData" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="customerData">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="NameField1"/>
<xsd:element ref="NameField2"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element type="name_field" name="NameField1"/>
<xsd:element type="name_field" name="NameField2"/>
<xsd:simpleType name="name_field">
<xsd:restriction base="xsd:string">
<xsd:maxLength value="45"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
这 5 个例子:
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<NameField1>Somecompany</NameField1>
<NameField2>Somefirstname</NameField2>
</customerData>
<customerData>
<NameField1>Somecompany</NameField1>
<NameField2>Somefirstname</NameField2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown1>Somecompany</Unknown1>
<NameField2>Somefirstname</NameField2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
<NameField2>Somefirstname</NameField2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<NameField1>Somecompany</NameField1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<NameField1>Somecompany</NameField1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown1>Somecompany</Unknown1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
</customerData>
</customerList>
</customerDataFile>
即输出如下:
- 没有错误 - 正确
- 2 个错误(每个客户数据一次)- 正确
- 2 个错误(每个客户数据一次)- 正确
- 2 个错误(每个客户数据只有一个)- 不正确
- 2 个错误(即使缺少元素很严重)- 不正确
这太荒谬了;我找不到任何类似的参考资料(而且它看起来确实是一个主要问题)。
相关代码为:
public void process(String schemaLocation, String xmlLocation) {
Source source = new StreamSource(new File(schemaLocation));
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(source);
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setSchema(schema);
spf.setNamespaceAware(true);
SAXParser saxParser = spf.newSAXParser();
CustomerHandler handler = new CustomerHandler();
CustomerErrorHandler errorHandler = new CustomerErrorHandler();
InputStream inputStream = new FileInputStream(new File(xmlLocation));
Reader reader = new InputStreamReader(inputStream, "UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.setContentHandler(handler);
saxParser.setErrorHandler(errorHandler);
saxParser.parse(is); }
其中 CustomerErrorHandler 很简单
public class CustomerErrorHandler implements ErrorHandler {
@Override
public void error(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
@Override
public void fatalError(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
@Override
public void warning(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
}
有没有人对为什么会发生这种情况以及我做错了什么有任何指示,最重要的是,如果这种方法不起作用,如何正确地对 XML 文档进行全面验证?
这不是一个真正的答案,这更像是一个长评论:
出错时继续功能是一项扩展功能,并不是真正的标准功能。确切的实现在 Xerces 代码库中肯定可用,但可能不容易弄清楚。至少,从上面的测试中可以收集到的是,在元素上遇到验证错误时,Xerces 会忽略验证错误(尽管我确信它会检测到格式良好的错误,你可以尝试)直到元素的末尾(因为那里不再验证这个元素是没有意义的,它是无效的 w.r.t.teh schema),实际上跳过了整个元素并转到下一个元素并开始验证。这可能是一种行为,因为出错时继续不是标准,我想实现是在 'best case effort' 的基础上完成的,如果无法验证某些内容,请忽略它并尝试验证下一个元素。
我正在解析和验证 (xsd) 长 XML(总是 well-formed)文件,报告所有验证问题.
我的解析器报告并继续处理错误,但有一个奇怪的例外:当一个由多个节点 (children) 组成的节点 (parent) 在任何节点上验证失败时child 个节点,对所有 children 继续正确解析,但验证停止直到下一个 parent 个节点开始 。
考虑简单 XSD:
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="customerDataFile">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="customerList"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="customerList">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="customerData" minOccurs="1" maxOccurs="unbounded"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="customerData">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="NameField1"/>
<xsd:element ref="NameField2"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element type="name_field" name="NameField1"/>
<xsd:element type="name_field" name="NameField2"/>
<xsd:simpleType name="name_field">
<xsd:restriction base="xsd:string">
<xsd:maxLength value="45"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:schema>
这 5 个例子:
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<NameField1>Somecompany</NameField1>
<NameField2>Somefirstname</NameField2>
</customerData>
<customerData>
<NameField1>Somecompany</NameField1>
<NameField2>Somefirstname</NameField2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown1>Somecompany</Unknown1>
<NameField2>Somefirstname</NameField2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
<NameField2>Somefirstname</NameField2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<NameField1>Somecompany</NameField1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<NameField1>Somecompany</NameField1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown1>Somecompany</Unknown1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
<Unknown2>Somefirstname</Unknown2>
</customerData>
</customerList>
</customerDataFile>
<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
<customerList>
<customerData>
<Unknown2>Somefirstname</Unknown2>
</customerData>
<customerData>
<Unknown1>Somecompany</Unknown1>
</customerData>
</customerList>
</customerDataFile>
即输出如下:
- 没有错误 - 正确
- 2 个错误(每个客户数据一次)- 正确
- 2 个错误(每个客户数据一次)- 正确
- 2 个错误(每个客户数据只有一个)- 不正确
- 2 个错误(即使缺少元素很严重)- 不正确
这太荒谬了;我找不到任何类似的参考资料(而且它看起来确实是一个主要问题)。
相关代码为:
public void process(String schemaLocation, String xmlLocation) {
Source source = new StreamSource(new File(schemaLocation));
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(source);
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setSchema(schema);
spf.setNamespaceAware(true);
SAXParser saxParser = spf.newSAXParser();
CustomerHandler handler = new CustomerHandler();
CustomerErrorHandler errorHandler = new CustomerErrorHandler();
InputStream inputStream = new FileInputStream(new File(xmlLocation));
Reader reader = new InputStreamReader(inputStream, "UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.setContentHandler(handler);
saxParser.setErrorHandler(errorHandler);
saxParser.parse(is); }
其中 CustomerErrorHandler 很简单
public class CustomerErrorHandler implements ErrorHandler {
@Override
public void error(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
@Override
public void fatalError(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
@Override
public void warning(SAXParseException arg0) throws SAXException {
System.out.println(arg0.getMessage());
}
}
有没有人对为什么会发生这种情况以及我做错了什么有任何指示,最重要的是,如果这种方法不起作用,如何正确地对 XML 文档进行全面验证?
这不是一个真正的答案,这更像是一个长评论:
出错时继续功能是一项扩展功能,并不是真正的标准功能。确切的实现在 Xerces 代码库中肯定可用,但可能不容易弄清楚。至少,从上面的测试中可以收集到的是,在元素上遇到验证错误时,Xerces 会忽略验证错误(尽管我确信它会检测到格式良好的错误,你可以尝试)直到元素的末尾(因为那里不再验证这个元素是没有意义的,它是无效的 w.r.t.teh schema),实际上跳过了整个元素并转到下一个元素并开始验证。这可能是一种行为,因为出错时继续不是标准,我想实现是在 'best case effort' 的基础上完成的,如果无法验证某些内容,请忽略它并尝试验证下一个元素。