使用 JDOM 读写内部 DTD
Using JDOM to read and write internal DTDs
这是问题
的跟进
我有一个简单的 DAISY DTBook XML 文件(虽然特定的 DTD 对我的问题并不重要,但这是旧有声书籍中使用的实际标准。)它包含 XML 来自DTBook 和 MathML 命名空间。
请注意,DTD 声明遵循我从 specification for MathML in DAISY 复制的约定,它使用组合 DTD,既引用 DTBook 标准的外部 DTD,又为 MathML 添加一些内部 ENTITY 定义标准。
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dtbook PUBLIC "-//NISO//DTD dtbook 2005-2//EN"
"http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd"
[
<!ENTITY % MATHML.prefixed "INCLUDE" >
<!ENTITY % MATHML.prefix "m">
<!ENTITY % MATHML.Common.attrib
"xlink:href CDATA #IMPLIED
xlink:type CDATA #IMPLIED
class CDATA #IMPLIED
style CDATA #IMPLIED
id ID #IMPLIED
xref IDREF #IMPLIED
other CDATA #IMPLIED
xmlns:dtbook CDATA #FIXED 'http://www.daisy.org/z3986/2005/dtbook/'
dtbook:smilref CDATA #IMPLIED"
>
<!ENTITY % mathML2 PUBLIC "-//W3C//DTD MathML 2.0//EN"
"http://www.w3.org/Math/DTD/mathml2/mathml2.dtd"
>
%mathML2;
<!ENTITY % externalFlow "| m:math">
<!ENTITY % externalNamespaces "xmlns:m CDATA #FIXED
'http://www.w3.org/1998/Math/MathML'">
]
>
<dtbook xmlns="http://www.daisy.org/z3986/2005/dtbook/" xmlns:m="http://www.w3.org/1998/Math/MathML"
version="2005-2" xml:lang="eng">
<head></head>
<book>
<frontmatter><doctitle></doctitle></frontmatter>
<bodymatter>
<level1>
<p>Test</p>
<m:math xmlns:dtbook="http://www.daisy.org/z3986/2005/dtbook/"
id="math0001" dtbook:smilref="nativemathml.smil#math0001" altimg="nativemathml0001.png"
alttext="sigma-summation UnderScript i equals zero OverScript infinity EndScripts x Subscript i">
<m:mrow>
<m:mstyle displaystyle='true'>
<m:munderover>
<m:mo>∑</m:mo>
<m:mrow>
<m:mi>i</m:mi>
<m:mo>=</m:mo>
<m:mn>0</m:mn>
</m:mrow>
<m:mi>∞</m:mi>
</m:munderover>
<m:mrow>
<m:msub>
<m:mi>x</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:mstyle>
</m:mrow>
</m:math>
</level1>
</bodymatter>
<rearmatter><level1><p></p></level1></rearmatter>
</book>
</dtbook>
我使用以下 Java 代码读入文档并将其打印出来。我首先使用 JDOM 1.1.3(因为这是针对较大项目的限制),但我也尝试使用 JDOM 2.0.6。
@Test
public void buildDTD2()
throws IOException, JDOMException
{
final PathMatchingResourcePatternResolver pmrpr = new PathMatchingResourcePatternResolver();
final File file = pmrpr.getResource("daisy/mathmldtdtemplate.xml").getFile();
final String uri = file.toURI().toString();
final InputStream stream = new BufferedInputStream(new FileInputStream(file));
final SAXBuilder saxBuilder = new SAXBuilder();
saxBuilder.setValidation(true);
saxBuilder.setFeature("http://apache.org/xml/features/validation/schema", true);
final InputSource source = new InputSource(new BufferedInputStream(stream));
source.setSystemId(uri);
final Document doc = saxBuilder.build(source);
String xml2 = new XMLOutputter().outputString(doc);
System.out.println(xml2);
System.out.println("Internal Subset: " + doc.getDocType().getInternalSubset());
}
当我在最后一行使用System.out.println
打印出getInternalSubset()
时,什么也没有打印出来。当我打印出整个文档时,我得到了这个:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dtbook PUBLIC "-//NISO//DTD dtbook 2005-2//EN" "http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd">
<dtbook xmlns="http://www.daisy.org/z3986/2005/dtbook/" xmlns:m="http://www.w3.org/1998/Math/MathML" version="2005-2" xml:lang="eng">
<head />
<book>
<frontmatter><doctitle /></frontmatter>
<bodymatter>
<level1>
<p>Test</p>
<m:math xmlns:dtbook="http://www.daisy.org/z3986/2005/dtbook/" id="math0001" dtbook:smilref="nativemathml.smil#math0001" altimg="nativemathml0001.png" alttext="sigma-summation UnderScript i equals zero OverScript infinity EndScripts x Subscript i" overflow="scroll">
<m:mrow>
<m:mstyle displaystyle="true">
<m:munderover>
<m:mo>∑</m:mo>
<m:mrow>
<m:mi>i</m:mi>
<m:mo>=</m:mo>
<m:mn>0</m:mn>
</m:mrow>
<m:mi>∞</m:mi>
</m:munderover>
<m:mrow>
<m:msub>
<m:mi>x</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:mstyle>
</m:mrow>
</m:math>
</level1>
</bodymatter>
<rearmatter><level1><p /></level1></rearmatter>
</book>
</dtbook>
实体定义不见了!我错过了一些可以让我维护它们的选项吗?我该如何维护它们?我们在处理这些文件的时候可能需要多次读入和写出而不会丢失这个DTD。
经过进一步研究,我发现a solution on the jdom-interest list。
添加声明 saxBuilder.setExpandEntities(false);
,根据 Laurent Bihanic 的说法,这将强制注册 DeclHandler。
@Test
public void buildDTD2()
throws IOException, JDOMException
{
final PathMatchingResourcePatternResolver pmrpr = new PathMatchingResourcePatternResolver();
final File file = pmrpr.getResource("daisy/mathmldtdtemplate.xml").getFile();
final String uri = file.toURI().toString();
final InputStream stream = new BufferedInputStream(new FileInputStream(file));
final SAXBuilder saxBuilder = new SAXBuilder();
saxBuilder.setValidation(true);
saxBuilder.setFeature("http://apache.org/xml/features/validation/schema", true);
saxBuilder.setExpandEntities(false);
final InputSource source = new InputSource(new BufferedInputStream(stream));
source.setSystemId(uri);
final Document doc = saxBuilder.build(source);
String xml2 = new XMLOutputter().outputString(doc);
System.out.println(xml2);
System.out.println("Internal Subset: " + doc.getDocType().getInternalSubset());
}
这行得通;现在内部子集在 "Internal Subset:".
之后被读入并打印出来
这是问题
我有一个简单的 DAISY DTBook XML 文件(虽然特定的 DTD 对我的问题并不重要,但这是旧有声书籍中使用的实际标准。)它包含 XML 来自DTBook 和 MathML 命名空间。
请注意,DTD 声明遵循我从 specification for MathML in DAISY 复制的约定,它使用组合 DTD,既引用 DTBook 标准的外部 DTD,又为 MathML 添加一些内部 ENTITY 定义标准。
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dtbook PUBLIC "-//NISO//DTD dtbook 2005-2//EN"
"http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd"
[
<!ENTITY % MATHML.prefixed "INCLUDE" >
<!ENTITY % MATHML.prefix "m">
<!ENTITY % MATHML.Common.attrib
"xlink:href CDATA #IMPLIED
xlink:type CDATA #IMPLIED
class CDATA #IMPLIED
style CDATA #IMPLIED
id ID #IMPLIED
xref IDREF #IMPLIED
other CDATA #IMPLIED
xmlns:dtbook CDATA #FIXED 'http://www.daisy.org/z3986/2005/dtbook/'
dtbook:smilref CDATA #IMPLIED"
>
<!ENTITY % mathML2 PUBLIC "-//W3C//DTD MathML 2.0//EN"
"http://www.w3.org/Math/DTD/mathml2/mathml2.dtd"
>
%mathML2;
<!ENTITY % externalFlow "| m:math">
<!ENTITY % externalNamespaces "xmlns:m CDATA #FIXED
'http://www.w3.org/1998/Math/MathML'">
]
>
<dtbook xmlns="http://www.daisy.org/z3986/2005/dtbook/" xmlns:m="http://www.w3.org/1998/Math/MathML"
version="2005-2" xml:lang="eng">
<head></head>
<book>
<frontmatter><doctitle></doctitle></frontmatter>
<bodymatter>
<level1>
<p>Test</p>
<m:math xmlns:dtbook="http://www.daisy.org/z3986/2005/dtbook/"
id="math0001" dtbook:smilref="nativemathml.smil#math0001" altimg="nativemathml0001.png"
alttext="sigma-summation UnderScript i equals zero OverScript infinity EndScripts x Subscript i">
<m:mrow>
<m:mstyle displaystyle='true'>
<m:munderover>
<m:mo>∑</m:mo>
<m:mrow>
<m:mi>i</m:mi>
<m:mo>=</m:mo>
<m:mn>0</m:mn>
</m:mrow>
<m:mi>∞</m:mi>
</m:munderover>
<m:mrow>
<m:msub>
<m:mi>x</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:mstyle>
</m:mrow>
</m:math>
</level1>
</bodymatter>
<rearmatter><level1><p></p></level1></rearmatter>
</book>
</dtbook>
我使用以下 Java 代码读入文档并将其打印出来。我首先使用 JDOM 1.1.3(因为这是针对较大项目的限制),但我也尝试使用 JDOM 2.0.6。
@Test
public void buildDTD2()
throws IOException, JDOMException
{
final PathMatchingResourcePatternResolver pmrpr = new PathMatchingResourcePatternResolver();
final File file = pmrpr.getResource("daisy/mathmldtdtemplate.xml").getFile();
final String uri = file.toURI().toString();
final InputStream stream = new BufferedInputStream(new FileInputStream(file));
final SAXBuilder saxBuilder = new SAXBuilder();
saxBuilder.setValidation(true);
saxBuilder.setFeature("http://apache.org/xml/features/validation/schema", true);
final InputSource source = new InputSource(new BufferedInputStream(stream));
source.setSystemId(uri);
final Document doc = saxBuilder.build(source);
String xml2 = new XMLOutputter().outputString(doc);
System.out.println(xml2);
System.out.println("Internal Subset: " + doc.getDocType().getInternalSubset());
}
当我在最后一行使用System.out.println
打印出getInternalSubset()
时,什么也没有打印出来。当我打印出整个文档时,我得到了这个:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE dtbook PUBLIC "-//NISO//DTD dtbook 2005-2//EN" "http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd">
<dtbook xmlns="http://www.daisy.org/z3986/2005/dtbook/" xmlns:m="http://www.w3.org/1998/Math/MathML" version="2005-2" xml:lang="eng">
<head />
<book>
<frontmatter><doctitle /></frontmatter>
<bodymatter>
<level1>
<p>Test</p>
<m:math xmlns:dtbook="http://www.daisy.org/z3986/2005/dtbook/" id="math0001" dtbook:smilref="nativemathml.smil#math0001" altimg="nativemathml0001.png" alttext="sigma-summation UnderScript i equals zero OverScript infinity EndScripts x Subscript i" overflow="scroll">
<m:mrow>
<m:mstyle displaystyle="true">
<m:munderover>
<m:mo>∑</m:mo>
<m:mrow>
<m:mi>i</m:mi>
<m:mo>=</m:mo>
<m:mn>0</m:mn>
</m:mrow>
<m:mi>∞</m:mi>
</m:munderover>
<m:mrow>
<m:msub>
<m:mi>x</m:mi>
<m:mi>i</m:mi>
</m:msub>
</m:mrow>
</m:mstyle>
</m:mrow>
</m:math>
</level1>
</bodymatter>
<rearmatter><level1><p /></level1></rearmatter>
</book>
</dtbook>
实体定义不见了!我错过了一些可以让我维护它们的选项吗?我该如何维护它们?我们在处理这些文件的时候可能需要多次读入和写出而不会丢失这个DTD。
经过进一步研究,我发现a solution on the jdom-interest list。
添加声明 saxBuilder.setExpandEntities(false);
,根据 Laurent Bihanic 的说法,这将强制注册 DeclHandler。
@Test
public void buildDTD2()
throws IOException, JDOMException
{
final PathMatchingResourcePatternResolver pmrpr = new PathMatchingResourcePatternResolver();
final File file = pmrpr.getResource("daisy/mathmldtdtemplate.xml").getFile();
final String uri = file.toURI().toString();
final InputStream stream = new BufferedInputStream(new FileInputStream(file));
final SAXBuilder saxBuilder = new SAXBuilder();
saxBuilder.setValidation(true);
saxBuilder.setFeature("http://apache.org/xml/features/validation/schema", true);
saxBuilder.setExpandEntities(false);
final InputSource source = new InputSource(new BufferedInputStream(stream));
source.setSystemId(uri);
final Document doc = saxBuilder.build(source);
String xml2 = new XMLOutputter().outputString(doc);
System.out.println(xml2);
System.out.println("Internal Subset: " + doc.getDocType().getInternalSubset());
}
这行得通;现在内部子集在 "Internal Subset:".
之后被读入并打印出来