无法使用 io 模块序列化 minidom 树
Can't serialize minidom tree with io module
我必须处理使用 xml.dom.minidom
(and I can't migrate to lxml
) 的遗留代码。
我想解析这个最小样本:
<body>
<p>English</p>
<p>Français</p>
</body>
以下函数完美运行:
import codecs
import xml.dom.minidom
def transform1(src_path, dst_path):
tree = xml.dom.minidom.parse(src_path)
# ...
with codecs.open(dst_path, mode="w", encoding="utf-8") as fd:
tree.writexml(fd, encoding="utf-8")
但是,如果我改为使用 io
,一切都会出错:
Traceback (most recent call last):
File "/path/to/minidom_demo.py", line 23, in <module>
transform2("sample.xml", "result.xml")
File "/path/to/minidom_demo.py", line 18, in transform2
tree.writexml(fd, encoding="utf-8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1747, in writexml
writer.write('<?xml version="1.0" encoding="%s"?>%s' % (encoding, newl))
TypeError: must be unicode, not str
如果我以二进制模式打开文件 (mode="wb"
) 我有另一个异常说:
Traceback (most recent call last):
File "/path/to/minidom_demo.py", line 23, in <module>
transform2("sample.xml", "result.xml")
File "/path/to/minidom_demo.py", line 18, in transform2
tree.writexml(fd, encoding="utf-8")
...
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 298, in _write_data
writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 4: ordinal not in range(128)
minidom 作者似乎不知道 Unicode。
为什么它适用于 codecs
?
有办法解决这个问题吗?
writexml
方法似乎总是转储 str
。阅读文档告诉我它的 encoding
参数仅将编码属性添加到 XML header.
Changed in version 2.3: For the Document node, an additional keyword
argument encoding can be used to specify the encoding field of the XML
header.
您可以试试:
fd.write(tree.toxml(encoding="utf-8").decode("utf-8"))
以上将 XML 保存为 UTF-8 并在 XML header 中指定编码。
如果不指定编码,仍会保存为UTF-8,但编码属性不会包含在header.
fd.write(tree.toxml())
如果您指定编码,但不指定 decode()
,它会引发异常 toxml()
returns a str
,但这很奇怪。
TypeError: write() argument 1 must be unicode, not str
我必须处理使用 xml.dom.minidom
(and I can't migrate to lxml
) 的遗留代码。
我想解析这个最小样本:
<body>
<p>English</p>
<p>Français</p>
</body>
以下函数完美运行:
import codecs
import xml.dom.minidom
def transform1(src_path, dst_path):
tree = xml.dom.minidom.parse(src_path)
# ...
with codecs.open(dst_path, mode="w", encoding="utf-8") as fd:
tree.writexml(fd, encoding="utf-8")
但是,如果我改为使用 io
,一切都会出错:
Traceback (most recent call last):
File "/path/to/minidom_demo.py", line 23, in <module>
transform2("sample.xml", "result.xml")
File "/path/to/minidom_demo.py", line 18, in transform2
tree.writexml(fd, encoding="utf-8")
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1747, in writexml
writer.write('<?xml version="1.0" encoding="%s"?>%s' % (encoding, newl))
TypeError: must be unicode, not str
如果我以二进制模式打开文件 (mode="wb"
) 我有另一个异常说:
Traceback (most recent call last):
File "/path/to/minidom_demo.py", line 23, in <module>
transform2("sample.xml", "result.xml")
File "/path/to/minidom_demo.py", line 18, in transform2
tree.writexml(fd, encoding="utf-8")
...
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 298, in _write_data
writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 4: ordinal not in range(128)
minidom 作者似乎不知道 Unicode。
为什么它适用于 codecs
?
有办法解决这个问题吗?
writexml
方法似乎总是转储 str
。阅读文档告诉我它的 encoding
参数仅将编码属性添加到 XML header.
Changed in version 2.3: For the Document node, an additional keyword argument encoding can be used to specify the encoding field of the XML header.
您可以试试:
fd.write(tree.toxml(encoding="utf-8").decode("utf-8"))
以上将 XML 保存为 UTF-8 并在 XML header 中指定编码。
如果不指定编码,仍会保存为UTF-8,但编码属性不会包含在header.
fd.write(tree.toxml())
如果您指定编码,但不指定 decode()
,它会引发异常 toxml()
returns a str
,但这很奇怪。
TypeError: write() argument 1 must be unicode, not str