我正在尝试使用 urllib 从站点获取 html 数据，但对于某些站点，我最终在 python 中得到了一些未知字符

Question

大家好，我正在尝试使用 urllib.openurl.read() 从网站获取 html 数据，但对于某些网站，我得到的只是数据 link 这个 * 6\xbdW\xb6\xd6\xff\xca\x9d\x9bO|\xc0\x96a\xc7\xc8\xf7\xa7\x10-\x8aM{\xf8\x* 而且我不知道它是什么以及为什么我会这样。我尝试用谷歌搜索它，有人说存在编码解码问题我也尝试过但是你看不到那里有运气所以请在这黑暗中引导我。这是我的代码 --- >

url = "http://mangafox.me/manga/online_the_comic/c001/1.html" # for this site and some more its not working
page = urllib.urlopen(url).read()
print page

你们知道打印此代码后发生了什么。

Answer 1

此页面为 gzip 格式，您在获取数据之前必须解压缩：

UnicodeDecodeError: 'ascii' codec can't decode byte 0x8b in position 1: ordinal not in range(128)

代码开头的0x8b表示gzip格式。

你应该看看这个问题：

twitter trends api UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

我正在尝试使用 urllib 从站点获取 html 数据，但对于某些站点，我最终在 python 中得到了一些未知字符

I am trying to get html data from a site using urllib but for some sites i am ending up with some unknown characters in python

python

urllib

chars