XMLStreamReader 如何处理相同类型的嵌套元素
XMLStreamReader how to work with nested elements of same type
我正在使用 XMLStreamReader 并解析以下 XML:
<root>
<element>
<attribute>level0</attribute>
<element>
<attribute>level1</attribute>
<element>
<attribute>level2</attribute>
</element>
</element>
</element>
</root>
我正在构建我的 XMLStreamReader:
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
new ByteArrayInputStream(document.getBytes()));
不幸的是,当我到达带有 reader.next();
的第一个结束元素标记时,出现以下异常:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[7,14]
Message: XML document structures must start and end within the same entity.
有没有办法覆盖 XMLStreamReader 的默认行为来解决这个问题?
编辑
这是我正在使用的代码:
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String document = value.toString();
System.out.println("'" + document + "'");
try {
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
new ByteArrayInputStream(document.getBytes()));
String propertyName = "";
String propertyValue = "";
String currentElement = "";
while (reader.hasNext()) {
int code = reader.next();
switch (code) {
case START_ELEMENT:
currentElement = reader.getLocalName();
break;
case CHARACTERS:
if (currentElement.equalsIgnoreCase("element")) {
propertyName += reader.getText();
} else if (currentElement.equalsIgnoreCase("attribute")) {
propertyValue += reader.getText();
}
break;
}
}
reader.close();
context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
} catch (Exception e) {
e.printStackTrace();
}
}
示例 XML 文档 and/or StAX 解析器没有任何问题,可以使用以下代码进行检查:
@Test
public void testSO_31815379() throws XMLStreamException, UnsupportedEncodingException {
final String xml =
"<root>\n" +
" <element>\n" +
" <attribute>level0</attribute>\n" +
" <element>\n" +
" <attribute>level1</attribute>\n" +
" <element>\n" +
" <attribute>level2</attribute>\n" +
" </element>\n" +
" </element>\n" +
" </element>\n" +
"</root>";
final XMLStreamReader reader = XMLInputFactory.newInstance()
.createXMLStreamReader(new ByteArrayInputStream(xml.getBytes("UTF-8")));
LOG.info("Using XMLStreamReader implementation: %s", reader.getClass().getName());
reader.require(XMLStreamConstants.START_DOCUMENT, null, null);
int event;
while ((event = reader.next()) != XMLStreamConstants.END_DOCUMENT) {
LOG.info(StaxUtils.eventDescription(reader));
}
reader.require(XMLStreamConstants.END_DOCUMENT, null, null);
reader.close();
}
输出(StaxUtils.eventDescription
是自定义辅助方法)
Using XMLStreamReader implementation: com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl
START_ELEMENT<{}root>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level0'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level1'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level2'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<root>
我正在使用 XMLStreamReader 并解析以下 XML:
<root>
<element>
<attribute>level0</attribute>
<element>
<attribute>level1</attribute>
<element>
<attribute>level2</attribute>
</element>
</element>
</element>
</root>
我正在构建我的 XMLStreamReader:
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
new ByteArrayInputStream(document.getBytes()));
不幸的是,当我到达带有 reader.next();
的第一个结束元素标记时,出现以下异常:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[7,14]
Message: XML document structures must start and end within the same entity.
有没有办法覆盖 XMLStreamReader 的默认行为来解决这个问题?
编辑
这是我正在使用的代码:
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String document = value.toString();
System.out.println("'" + document + "'");
try {
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(
new ByteArrayInputStream(document.getBytes()));
String propertyName = "";
String propertyValue = "";
String currentElement = "";
while (reader.hasNext()) {
int code = reader.next();
switch (code) {
case START_ELEMENT:
currentElement = reader.getLocalName();
break;
case CHARACTERS:
if (currentElement.equalsIgnoreCase("element")) {
propertyName += reader.getText();
} else if (currentElement.equalsIgnoreCase("attribute")) {
propertyValue += reader.getText();
}
break;
}
}
reader.close();
context.write(new Text(propertyName.trim()), new Text(propertyValue.trim()));
} catch (Exception e) {
e.printStackTrace();
}
}
示例 XML 文档 and/or StAX 解析器没有任何问题,可以使用以下代码进行检查:
@Test
public void testSO_31815379() throws XMLStreamException, UnsupportedEncodingException {
final String xml =
"<root>\n" +
" <element>\n" +
" <attribute>level0</attribute>\n" +
" <element>\n" +
" <attribute>level1</attribute>\n" +
" <element>\n" +
" <attribute>level2</attribute>\n" +
" </element>\n" +
" </element>\n" +
" </element>\n" +
"</root>";
final XMLStreamReader reader = XMLInputFactory.newInstance()
.createXMLStreamReader(new ByteArrayInputStream(xml.getBytes("UTF-8")));
LOG.info("Using XMLStreamReader implementation: %s", reader.getClass().getName());
reader.require(XMLStreamConstants.START_DOCUMENT, null, null);
int event;
while ((event = reader.next()) != XMLStreamConstants.END_DOCUMENT) {
LOG.info(StaxUtils.eventDescription(reader));
}
reader.require(XMLStreamConstants.END_DOCUMENT, null, null);
reader.close();
}
输出(StaxUtils.eventDescription
是自定义辅助方法)
Using XMLStreamReader implementation: com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl
START_ELEMENT<{}root>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level0'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level1'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
START_ELEMENT<{}element>
CHARACTERS=<whitespace>
START_ELEMENT<{}attribute>
CHARACTERS='level2'
END_ELEMENT<attribute>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<element>
CHARACTERS=<whitespace>
END_ELEMENT<root>