BeautifulSoup 无法使用 `html5lib` 解析 html
BeautifulSoup fails to parse html with `html5lib`
BeautifulSoup 无法解析带有选项 html5lib
的 html 页面,但可以正常使用选项 html.parser
。按照docs,html5lib
应该比html.parser
更宽松,为什么我用它解析html页面时遇到乱码?
下面是一个可执行的小例子。(将html5lib
改成[=14=后,中文输出正常。)
#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup
ss = requests.Session()
res = ss.get("http://tech.qq.com/a/20151225/050487.htm")
html = res.content.decode("GBK").encode("utf-8")
soup = BeautifulSoup(html, 'html5lib')
print str(soup)[0:800] # where you can see if the html is parsed normally or not
不要重新编码您的内容。将解码处理留给 Beautifulsoup:
soup = BeautifulSoup(res.content, 'html5lib')
如果您要 re-encode,您需要替换源中存在的 meta
header:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
或手动解码传入Unicode:
soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')
BeautifulSoup 无法解析带有选项 html5lib
的 html 页面,但可以正常使用选项 html.parser
。按照docs,html5lib
应该比html.parser
更宽松,为什么我用它解析html页面时遇到乱码?
下面是一个可执行的小例子。(将html5lib
改成[=14=后,中文输出正常。)
#_*_coding:utf-8_*_
import requests
from bs4 import BeautifulSoup
ss = requests.Session()
res = ss.get("http://tech.qq.com/a/20151225/050487.htm")
html = res.content.decode("GBK").encode("utf-8")
soup = BeautifulSoup(html, 'html5lib')
print str(soup)[0:800] # where you can see if the html is parsed normally or not
不要重新编码您的内容。将解码处理留给 Beautifulsoup:
soup = BeautifulSoup(res.content, 'html5lib')
如果您要 re-encode,您需要替换源中存在的 meta
header:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
或手动解码传入Unicode:
soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')