如何使用 python 对网页进行解码和编码？

Question

我使用Beautifulsoup和urllib2下载网页，但是不同的网页有不同的编码方式，比如utf-8,gb2312,gbk。我使用urllib2获取sohu的主页，它是用gbk编码的，但是在我的代码中，我也使用这种方式来解码它的网页：

self.html_doc = self.html_doc.decode('gb2312','ignore')

但是在使用 BeautifulSoup 将页面解码为 unicode 之前，我如何知道页面使用的编码方法？在大多数中文网站中，http Header 字段中没有content-type。

Answer 1

使用 BeautifulSoup 您可以解析 HTML 并访问 original_encoding 属性：

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)

>>> soup.original_encoding
u'gbk'

这与 HTML 的 <head> 中 <meta> 标记中声明的编码一致：

<meta http-equiv="content-type" content="text/html; charset=GBK" />

>>> soup.meta['content']
u'text/html; charset=GBK'

现在你可以解码 HTML:

decoded_html = html.decode(soup.original_encoding)

但没有太多意义，因为 HTML 已经作为 unicode 可用：

>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐

也可以尝试使用chardet模块检测它（虽然有点慢）：

>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}

Answer 2

我知道这是一个老问题，但我今天花了一段时间对一个特别有问题的网站感到困惑，所以我想我会分享对我有用的解决方案，我从这里得到的：http://shunchiubc.blogspot.com/2016/08/python-to-scrape-chinese-websites.html

Requests 有一个功能，可以自动获取网站的实际编码，这意味着您不必费力 encoding/decoding 它（在我发现这个之前，我遇到了各种各样的错误试图encode/decode strings/bytes 并且永远不会得到任何可读的输出）。此功能称为 apparent_encoding。以下是它对我有用的方法：

from bs4 import BeautifulSoup
import requests

url = 'http://url_youre_using_here.html'
readOut = requests.get(url)
readOut.encoding = readOut.apparent_encoding #sets the encoding properly before you hand it off to BeautifulSoup
soup = BeautifulSoup(readOut.text, "lxml")

Answer 3

另一个解决方案。

from simplified_scrapy.request import req
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = req.get('http://www.sohu.com') # This will automatically help you find the correct encoding
doc = SimplifiedDoc(html)
print (doc.title.text)

如何使用 python 对网页进行解码和编码？

how to decode and encode web page with python?

python

encoding

web