无法使用 io 模块序列化 minidom 树

Can't serialize minidom tree with io module

我必须处理使用 xml.dom.minidom (and I can't migrate to lxml) 的遗留代码。

我想解析这个最小样本:

<body>
    <p>English</p>
    <p>Français</p>
</body>

以下函数完美运行:

import codecs
import xml.dom.minidom


def transform1(src_path, dst_path):
    tree = xml.dom.minidom.parse(src_path)
    # ...
    with codecs.open(dst_path, mode="w", encoding="utf-8") as fd:
        tree.writexml(fd, encoding="utf-8")

但是,如果我改为使用 io,一切都会出错:

Traceback (most recent call last):
  File "/path/to/minidom_demo.py", line 23, in <module>
    transform2("sample.xml", "result.xml")
  File "/path/to/minidom_demo.py", line 18, in transform2
    tree.writexml(fd, encoding="utf-8")
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 1747, in writexml
    writer.write('<?xml version="1.0" encoding="%s"?>%s' % (encoding, newl))
TypeError: must be unicode, not str

如果我以二进制模式打开文件 (mode="wb") 我有另一个异常说:

Traceback (most recent call last):
  File "/path/to/minidom_demo.py", line 23, in <module>
    transform2("sample.xml", "result.xml")
  File "/path/to/minidom_demo.py", line 18, in transform2
    tree.writexml(fd, encoding="utf-8")
  ...
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/dom/minidom.py", line 298, in _write_data
    writer.write(data)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe7' in position 4: ordinal not in range(128)

minidom 作者似乎不知道 Unicode。

为什么它适用于 codecs

有办法解决这个问题吗?

writexml 方法似乎总是转储 str。阅读文档告诉我它的 encoding 参数仅将编码属性添加到 XML header.

Changed in version 2.3: For the Document node, an additional keyword argument encoding can be used to specify the encoding field of the XML header.

您可以试试:

fd.write(tree.toxml(encoding="utf-8").decode("utf-8"))

以上将 XML 保存为 UTF-8 并在 XML header 中指定编码。

如果不指定编码,仍会保存为UTF-8,但编码属性不会包含在header.

fd.write(tree.toxml())

如果您指定编码,但不指定 decode(),它会引发异常 toxml() returns a str,但这很奇怪。

TypeError: write() argument 1 must be unicode, not str