urllib.request.urlopen return 字节，但我无法对其进行解码

Question

我尝试使用 urllib.request 的 urlopen() 方法解析网页，例如：

from urllib.request import Request, urlopen
req = Request(url)
html = urlopen(req).read()

但是，最后一行 return 以字节为单位编辑了结果。

所以我试着解码它，比如：

html = urlopen(req).read().decode("utf-8")

但是，出现错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte.

通过一些研究，我发现 one related answer，它解析 charset 来决定解码。但是，该页面没有 return 字符集，当我尝试在 Chrome Web Inspector 上检查它时，在其 header:

中写入了以下行

<meta charset="utf-8">

那为什么我不能用utf-8解码呢？以及怎样才能成功解析网页？

网站 URL 是 http://www.vogue.com/fashion-shows/fall-2016-menswear/fendi/slideshow/collection#2，我想将图像保存到我的磁盘中。

请注意，我使用 Python 3.5.1。我还注意到我上面写的所有工作在我的其他抓取程序中都运行良好。

Answer 1

内容压缩为gzip。你需要解压它：

import gzip
from urllib.request import Request, urlopen

req = Request(url)
html = gzip.decompress(urlopen(req).read()).decode('utf-8')

如果您使用requests，它会自动为您解压：

import requests
html = requests.get(url).text  # => str, not bytes

urllib.request.urlopen return bytes, but I cannot decode it