Java 具有持续验证的 XMLReader:未定义的行为

Java XMLReader with continuous validation: undefined behaviour

我正在解析和验证 (xsd) 长 XML(总是 well-formed)文件,报告所有验证问题.

我的解析器报告并继续处理错误,但有一个奇怪的例外:当一个由多个节点 (children) 组成的节点 (parent) 在任何节点上验证失败时child 个节点,对所有 children 继续正确解析,但验证停止直到下一个 parent 个节点开始

考虑简单 XSD:

<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

 <xsd:element name="customerDataFile">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="customerList"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="customerList">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="customerData" minOccurs="1" maxOccurs="unbounded"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="customerData">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="NameField1"/>
    <xsd:element ref="NameField2"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element type="name_field" name="NameField1"/>
 <xsd:element type="name_field" name="NameField2"/>

 <xsd:simpleType name="name_field">
  <xsd:restriction base="xsd:string">
    <xsd:maxLength value="45"/>
  </xsd:restriction>
 </xsd:simpleType>

</xsd:schema>

这 5 个例子:

<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
 <customerList>
  <customerData>
   <NameField1>Somecompany</NameField1>
   <NameField2>Somefirstname</NameField2>
  </customerData>
  <customerData>
   <NameField1>Somecompany</NameField1>
   <NameField2>Somefirstname</NameField2>
  </customerData>
 </customerList>
</customerDataFile>

<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
 <customerList>
  <customerData>
   <Unknown1>Somecompany</Unknown1>
   <NameField2>Somefirstname</NameField2>
  </customerData>
  <customerData>
   <Unknown1>Somecompany</Unknown1>
   <NameField2>Somefirstname</NameField2>
  </customerData>
 </customerList>
</customerDataFile>

<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
 <customerList>
  <customerData>
   <NameField1>Somecompany</NameField1>
   <Unknown2>Somefirstname</Unknown2>
  </customerData>
  <customerData>
   <NameField1>Somecompany</NameField1>
   <Unknown2>Somefirstname</Unknown2>
  </customerData>
 </customerList>
</customerDataFile>

<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
 <customerList>
  <customerData>
   <Unknown1>Somecompany</Unknown1>
   <Unknown2>Somefirstname</Unknown2>
  </customerData>
  <customerData>
   <Unknown1>Somecompany</Unknown1>
   <Unknown2>Somefirstname</Unknown2>
  </customerData>
 </customerList>
</customerDataFile>

<?xml version="1.0" encoding="UTF-8"?>
<customerDataFile xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="customerDataFile.xsd">
 <customerList>
  <customerData>
   <Unknown2>Somefirstname</Unknown2>
  </customerData>
  <customerData>
   <Unknown1>Somecompany</Unknown1>
  </customerData>
 </customerList>
</customerDataFile>

即输出如下:

  1. 没有错误 - 正确
  2. 2 个错误(每个客户数据一次)- 正确
  3. 2 个错误(每个客户数据一次)- 正确
  4. 2 个错误(每个客户数据只有一个)- 不正确
  5. 2 个错误(即使缺少元素很严重)- 不正确

这太荒谬了;我找不到任何类似的参考资料(而且它看起来确实是一个主要问题)。

相关代码为:

public void process(String schemaLocation, String xmlLocation) {

Source source = new StreamSource(new File(schemaLocation));
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(source);

SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setSchema(schema);
spf.setNamespaceAware(true);
SAXParser saxParser = spf.newSAXParser();
CustomerHandler handler = new CustomerHandler();
CustomerErrorHandler errorHandler = new CustomerErrorHandler();

InputStream inputStream = new FileInputStream(new File(xmlLocation));
Reader reader = new InputStreamReader(inputStream, "UTF-8");
InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");
saxParser.setContentHandler(handler);
saxParser.setErrorHandler(errorHandler);
saxParser.parse(is); }

其中 CustomerErrorHandler 很简单

public class CustomerErrorHandler implements ErrorHandler {

@Override
public void error(SAXParseException arg0) throws SAXException {
    System.out.println(arg0.getMessage());
}

@Override
public void fatalError(SAXParseException arg0) throws SAXException {
    System.out.println(arg0.getMessage());
}

@Override
public void warning(SAXParseException arg0) throws SAXException {
    System.out.println(arg0.getMessage());
    }

}

有没有人对为什么会发生这种情况以及我做错了什么有任何指示,最重要的是,如果这种方法不起作用,如何正确地对 XML 文档进行全面验证?

这不是一个真正的答案,这更像是一个长评论:

出错时继续功能是一项扩展功能,并不是真正的标准功能。确切的实现在 Xerces 代码库中肯定可用,但可能不容易弄清楚。至少,从上面的测试中可以收集到的是,在元素上遇到验证错误时,Xerces 会忽略验证错误(尽管我确信它会检测到格式良好的错误,你可以尝试)直到元素的末尾(因为那里不再验证这个元素是没有意义的,它是无效的 w.r.t.teh schema),实际上跳过了整个元素并转到下一个元素并开始验证。这可能是一种行为,因为出错时继续不是标准,我想实现是在 'best case effort' 的基础上完成的,如果无法验证某些内容,请忽略它并尝试验证下一个元素。