如何防止 lxml 添加默认文档类型
How to prevent lxml from adding a default doctype
lxml 似乎在 html 文档中缺少一个默认文档类型。
查看此演示代码:
import lxml.etree
import lxml.html
def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)
with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
如何告诉 lxml 不要这样做?
这个问题最初是在这里提出的:
https://github.com/mitmproxy/mitmproxy/issues/845
引用 comment on reddit 可能会有帮助:
lxml is based on libxml2, which does this by default unless you pass the option HTML_PARSE_NODEFDTD
, I believe. Code here.
I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.
目前无法在 lxml 中执行此操作,但我创建了一个 Pull Request on lxml,它向 HTMLParser
.
添加了一个 default_doctype
布尔值
合并代码后,需要像这样创建解析器:
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True,
default_doctype=False,
)
其他一切保持不变。
lxml 似乎在 html 文档中缺少一个默认文档类型。
查看此演示代码:
import lxml.etree
import lxml.html
def beautify(html):
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True
)
d = lxml.html.fromstring(html, parser=parser)
docinfo = d.getroottree().docinfo
return lxml.etree.tostring(
d,
pretty_print=True,
doctype=docinfo.doctype,
encoding='utf8'
)
with_doctype = """
<!DOCTYPE html>
<html>
<head>
<title>With Doctype</title>
</head>
</html>
"""
# This passes!
assert "DOCTYPE" in beautify(with_doctype)
no_doctype = """<html>
<head>
<title>No Doctype</title>
</head>
</html>"""
# This fails!
assert "DOCTYPE" not in beautify(no_doctype)
# because the returned html contains this line
# <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# which was not present in the source before
如何告诉 lxml 不要这样做?
这个问题最初是在这里提出的: https://github.com/mitmproxy/mitmproxy/issues/845
引用 comment on reddit 可能会有帮助:
lxml is based on libxml2, which does this by default unless you pass the option
HTML_PARSE_NODEFDTD
, I believe. Code here.I don't know if you can tell lxml to pass that option though.. libxml has python bindings that you could perhaps use directly but they seem really hairy.
EDIT: did some more digging and that option does appear in the lxml soure here. That option does exactly what you want but I'm not sure how to activate it yet, if it's even possible.
目前无法在 lxml 中执行此操作,但我创建了一个 Pull Request on lxml,它向 HTMLParser
.
default_doctype
布尔值
合并代码后,需要像这样创建解析器:
parser = lxml.etree.HTMLParser(
strip_cdata=True,
remove_blank_text=True,
default_doctype=False,
)
其他一切保持不变。