BeautifulSoup - 处理自动关闭标签的正确方法
BeautifulSoup - proper way of dealing with self-closing tags
我有一个 html 文件,其中包含一些自关闭标签,但 BeautifulSoup 不喜欢它们。
from bs4 import BeautifulSoup
html = '<head><meta content="text/html" http-equiv="Content-Type"><meta charset="utf-8"></head>'
doc = BeautifulSoup(html, 'html.parser')
print doc.prettify()
打印
<head>
<meta content="text/html" http-equiv="Content-Type">
<meta charset="utf-8"/>
</meta>
</head>
我是否必须手动检查每个标签是否自动关闭并适当修改,或者是否有更好的处理方法?
您可能已经知道,您可以指定 BeautifulSoup
将在内部使用的不同解析器。并且,如 BeautifulSoup
docs 中所述:
There are also differences between HTML parsers. If you give Beautiful
Soup a perfectly-formed HTML document, these differences won’t matter.
One parser will be faster than another, but they’ll all give you a
data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will
give different results.
在这种特殊情况下,lxml
和 html5lib
都会产生两个单独的 meta
标签:
In [4]: doc = BeautifulSoup(html, 'lxml')
In [5]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
</html>
In [6]: doc = BeautifulSoup(html, 'html5lib')
In [7]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>
我有一个 html 文件,其中包含一些自关闭标签,但 BeautifulSoup 不喜欢它们。
from bs4 import BeautifulSoup
html = '<head><meta content="text/html" http-equiv="Content-Type"><meta charset="utf-8"></head>'
doc = BeautifulSoup(html, 'html.parser')
print doc.prettify()
打印
<head>
<meta content="text/html" http-equiv="Content-Type">
<meta charset="utf-8"/>
</meta>
</head>
我是否必须手动检查每个标签是否自动关闭并适当修改,或者是否有更好的处理方法?
您可能已经知道,您可以指定 BeautifulSoup
将在内部使用的不同解析器。并且,如 BeautifulSoup
docs 中所述:
There are also differences between HTML parsers. If you give Beautiful Soup a perfectly-formed HTML document, these differences won’t matter. One parser will be faster than another, but they’ll all give you a data structure that looks exactly like the original HTML document.
But if the document is not perfectly-formed, different parsers will give different results.
在这种特殊情况下,lxml
和 html5lib
都会产生两个单独的 meta
标签:
In [4]: doc = BeautifulSoup(html, 'lxml')
In [5]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
</html>
In [6]: doc = BeautifulSoup(html, 'html5lib')
In [7]: print(doc.prettify())
<html>
<head>
<meta content="text/html" http-equiv="Content-Type"/>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>