如何防止 lxml 添加默认文档类型

How to prevent lxml from adding a default doctype

lxml 似乎在 html 文档中缺少一个默认文档类型。

查看此演示代码:

import lxml.etree
import lxml.html


def beautify(html):
    parser = lxml.etree.HTMLParser(
        strip_cdata=True,
        remove_blank_text=True
    )

    d = lxml.html.fromstring(html, parser=parser)
    docinfo = d.getroottree().docinfo

    return lxml.etree.tostring(
        d,
        pretty_print=True,
        doctype=docinfo.doctype,
        encoding='utf8'
    )


with_doctype = """
<!DOCTYPE html>
<html>
<head>
  <title>With Doctype</title>
</head>
</html>
"""

# This passes!
assert "DOCTYPE" in beautify(with_doctype)

no_doctype = """<html>
<head>
  <title>No Doctype</title>
</head>
</html>"""

# This fails!
assert "DOCTYPE" not in beautify(no_doctype)

# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before

如何告诉 lxml 不要这样做?

这个问题最初是在这里提出的: https://github.com/mitmproxy/mitmproxy/issues/845

引用 comment on reddit 可能会有帮助:

lxml is based on libxml2, which does this by default unless you pass the option HTML_PARSE_NODEFDTD, I believe. Code here.

I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.

EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.

目前无法在 lxml 中执行此操作,但我创建了一个 Pull Request on lxml,它向 HTMLParser.

添加了一个 default_doctype 布尔值

合并代码后,需要像这样创建解析器:

parser = lxml.etree.HTMLParser(
    strip_cdata=True,
    remove_blank_text=True,
    default_doctype=False,
)

其他一切保持不变。