获取 SAXParseException 格式不正确(无效令牌),无法解决问题
getting SAXParseException not well-formed (invalid token), unable to resolve issue
我需要在 scrapy 中解析一个非常大的 xml。有点像,
<Result>
<Node>
<browseNodeId>306533011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">temperature-controllers</attribute>
</browseNodeAttributes>
<browseNodeName>Temperature Controllers</browseNodeName>
<browseNodeStoreContextName>Temperature Controllers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,306533011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>TEMPERATURE_CONTROLLER</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>9931457011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-and-scientific-temperature-indicators</attribute>
</browseNodeAttributes>
<browseNodeName>Temperature Indicators</browseNodeName>
<browseNodeStoreContextName>Temperature Indicators</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,9931457011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Indicators</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>5006547011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-temperature-sensors</attribute>
</browseNodeAttributes>
<browseNodeName>Temperature Probes & Sensors</browseNodeName>
<browseNodeStoreContextName>Temperature Probes & Sensors</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,5006547011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Probes & Sensors</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>9931455011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">thermal-imagers</attribute>
</browseNodeAttributes>
<browseNodeName>Thermal Imagers</browseNodeName>
<browseNodeStoreContextName>Thermal Imagers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,9931455011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermal Imagers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393280011</browseNodeId>
<browseNodeAttributes count="0"/>
<browseNodeName>Thermometers</browseNodeName>
<browseNodeStoreContextName>Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers</browsePathByName>
<hasChildren>true</hasChildren>
<childNodes count="4">
<id>393282011</id>
<id>393284011</id>
<id>393283011</id>
<id>9931459011</id>
</childNodes>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393282011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-and-scientific-dial-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Dial Thermometers</browseNodeName>
<browseNodeStoreContextName>Dial Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,393282011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Dial Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393284011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">science-lab-digital-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Digital Thermometers</browseNodeName>
<browseNodeStoreContextName>Lab Digital Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,393284011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Digital Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>LAB_SUPPLY</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393283011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-and-scientific-glass-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Glass Thermometers</browseNodeName>
<browseNodeStoreContextName>Glass Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,393283011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Glass Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>9931459011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">infrared-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Infrared Thermometers</browseNodeName>
<browseNodeStoreContextName>Infrared Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,9931459011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Infrared Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
</Result>
它给我 xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token)
错误。由于 xml 文件的大小非常大,我不能选择替换每个符号。
目前我还没有用scrapy实现过。尽管下面有一个简单的 class 供参考。如果不替换每个和号,如何解决这个问题。
import xml.sax
class ABContentHandler(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
def startElement(self, name, attrs):
print("startElement '" + name + "'")
if name == "address":
print("\tattribute type='" + attrs.getValue("type") + "'")
def endElement(self, name):
print("endElement '" + name + "'")
def characters(self, content):
print("characters '" + content + "'")
def main(sourceFileName):
source = open(sourceFileName)
xml.sax.parse(source, ABContentHandler())
if __name__ == "__main__":
main("nodes.xml")
输出
startElement 'Result'
characters '
'
characters ' '
startElement 'Node'
characters '
'
characters ' '
startElement 'browseNodeId'
characters '306533011'
endElement 'browseNodeId'
characters '
'
characters ' '
startElement 'browseNodeAttributes'
characters '
'
characters ' '
startElement 'attribute'
characters 'temperature-controllers'
endElement 'attribute'
characters '
'
characters ' '
endElement 'browseNodeAttributes'
characters '
'
characters ' '
startElement 'browseNodeName'
characters 'Temperature Controllers'
endElement 'browseNodeName'
characters '
'
characters ' '
startElement 'browseNodeStoreContextName'
characters 'Temperature Controllers'
endElement 'browseNodeStoreContextName'
characters '
'
characters ' '
Traceback (most recent call last):
File "/home/gtac/sax/parser.py", line 26, in <module>
main("nodes.xml")
File "/home/gtac/sax/parser.py", line 23, in main
xml.sax.parse(source, ABContentHandler())
File "/usr/lib/python2.7/xml/sax/__init__.py", line 33, in parse
parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token)
startElement 'browsePathById'
characters '16310091,16310161,256409011,5006566011,306533011'
endElement 'browsePathById'
characters '
'
characters ' '
startElement 'browsePathByName'
characters 'Industrial '
Process finished with exit code 1
错误显示问题在哪一行和字符。它位于
中的 &
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName>
单独使用 & 无效的问题 XML。 & 开始一个实体
W3C Recommendation in section 2.4 Character Data and Markup 说
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and "<" respectively. The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.
正确的解决方法是告诉 XML 的作者他们的输出无效,他们必须修复它。
否则你必须先解析文本并用 &
替换所有独立的 &
我需要在 scrapy 中解析一个非常大的 xml。有点像,
<Result>
<Node>
<browseNodeId>306533011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">temperature-controllers</attribute>
</browseNodeAttributes>
<browseNodeName>Temperature Controllers</browseNodeName>
<browseNodeStoreContextName>Temperature Controllers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,306533011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>TEMPERATURE_CONTROLLER</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>9931457011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-and-scientific-temperature-indicators</attribute>
</browseNodeAttributes>
<browseNodeName>Temperature Indicators</browseNodeName>
<browseNodeStoreContextName>Temperature Indicators</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,9931457011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Indicators</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>5006547011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-temperature-sensors</attribute>
</browseNodeAttributes>
<browseNodeName>Temperature Probes & Sensors</browseNodeName>
<browseNodeStoreContextName>Temperature Probes & Sensors</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,5006547011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Probes & Sensors</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>9931455011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">thermal-imagers</attribute>
</browseNodeAttributes>
<browseNodeName>Thermal Imagers</browseNodeName>
<browseNodeStoreContextName>Thermal Imagers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,9931455011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermal Imagers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393280011</browseNodeId>
<browseNodeAttributes count="0"/>
<browseNodeName>Thermometers</browseNodeName>
<browseNodeStoreContextName>Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers</browsePathByName>
<hasChildren>true</hasChildren>
<childNodes count="4">
<id>393282011</id>
<id>393284011</id>
<id>393283011</id>
<id>9931459011</id>
</childNodes>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393282011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-and-scientific-dial-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Dial Thermometers</browseNodeName>
<browseNodeStoreContextName>Dial Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,393282011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Dial Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393284011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">science-lab-digital-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Digital Thermometers</browseNodeName>
<browseNodeStoreContextName>Lab Digital Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,393284011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Digital Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>LAB_SUPPLY</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>393283011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">industrial-and-scientific-glass-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Glass Thermometers</browseNodeName>
<browseNodeStoreContextName>Glass Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,393283011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Glass Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
<Node>
<browseNodeId>9931459011</browseNodeId>
<browseNodeAttributes count="1">
<attribute name="item_type_keyword">infrared-thermometers</attribute>
</browseNodeAttributes>
<browseNodeName>Infrared Thermometers</browseNodeName>
<browseNodeStoreContextName>Infrared Thermometers</browseNodeStoreContextName>
<browsePathById>16310091,16310161,256409011,5006566011,393280011,9931459011</browsePathById>
<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Infrared Thermometers</browsePathByName>
<hasChildren>false</hasChildren>
<childNodes count="0"/>
<productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions>
<refinementsInformation count="0"/>
</Node>
</Result>
它给我 xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token)
错误。由于 xml 文件的大小非常大,我不能选择替换每个符号。
目前我还没有用scrapy实现过。尽管下面有一个简单的 class 供参考。如果不替换每个和号,如何解决这个问题。
import xml.sax
class ABContentHandler(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self)
def startElement(self, name, attrs):
print("startElement '" + name + "'")
if name == "address":
print("\tattribute type='" + attrs.getValue("type") + "'")
def endElement(self, name):
print("endElement '" + name + "'")
def characters(self, content):
print("characters '" + content + "'")
def main(sourceFileName):
source = open(sourceFileName)
xml.sax.parse(source, ABContentHandler())
if __name__ == "__main__":
main("nodes.xml")
输出
startElement 'Result'
characters '
'
characters ' '
startElement 'Node'
characters '
'
characters ' '
startElement 'browseNodeId'
characters '306533011'
endElement 'browseNodeId'
characters '
'
characters ' '
startElement 'browseNodeAttributes'
characters '
'
characters ' '
startElement 'attribute'
characters 'temperature-controllers'
endElement 'attribute'
characters '
'
characters ' '
endElement 'browseNodeAttributes'
characters '
'
characters ' '
startElement 'browseNodeName'
characters 'Temperature Controllers'
endElement 'browseNodeName'
characters '
'
characters ' '
startElement 'browseNodeStoreContextName'
characters 'Temperature Controllers'
endElement 'browseNodeStoreContextName'
characters '
'
characters ' '
Traceback (most recent call last):
File "/home/gtac/sax/parser.py", line 26, in <module>
main("nodes.xml")
File "/home/gtac/sax/parser.py", line 23, in main
xml.sax.parse(source, ABContentHandler())
File "/usr/lib/python2.7/xml/sax/__init__.py", line 33, in parse
parser.parse(source)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.7/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.7/xml/sax/expatreader.py", line 214, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.7/xml/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token)
startElement 'browsePathById'
characters '16310091,16310161,256409011,5006566011,306533011'
endElement 'browsePathById'
characters '
'
characters ' '
startElement 'browsePathByName'
characters 'Industrial '
Process finished with exit code 1
错误显示问题在哪一行和字符。它位于
中的 &<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName>
单独使用 & 无效的问题 XML。 & 开始一个实体
W3C Recommendation in section 2.4 Character Data and Markup 说
The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&" and "<" respectively. The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.
正确的解决方法是告诉 XML 的作者他们的输出无效,他们必须修复它。
否则你必须先解析文本并用 &