解码使用 urllib 下载的 html 文件
Decoding html file downloaded with urllib
我尝试下载这样的 html 文件:
import urllib
req = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()
print html
html = html.decode('utf-16')
print html
由于 req.read()
之后的输出看起来像 unicode 我尝试转换响应但出现此错误:
Traceback (most recent call last): File
"e:\Documents\Python\main.py", line 8, in <module>
html = html.decode('utf-16')
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate
我需要做什么才能获得正确的编码?
使用requests你会得到正确的解压缩HTML
import requests
r = requests.get("http://www.stream-urls.de/webradio")
print r.text
编辑:如何使用gzip
和StringIO
解压缩数据而不保存到文件
import urllib
import gzip
import StringIO
req = urllib.urlopen("http://www.stream-urls.de/webradio")
# create file-like object in memory
buf = StringIO.StringIO(req.read())
# create gzip object using file-like object instead of real file on disk
f = gzip.GzipFile(fileobj=buf)
# get data from file
html = f.read()
print html
我尝试下载这样的 html 文件:
import urllib
req = urllib.urlopen("http://www.stream-urls.de/webradio")
html = req.read()
print html
html = html.decode('utf-16')
print html
由于 req.read()
之后的输出看起来像 unicode 我尝试转换响应但出现此错误:
Traceback (most recent call last): File
"e:\Documents\Python\main.py", line 8, in <module>
html = html.decode('utf-16')
File "E:\Software\Python2.7\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 38-39: illegal UTF-16 surrogate
我需要做什么才能获得正确的编码?
使用requests你会得到正确的解压缩HTML
import requests
r = requests.get("http://www.stream-urls.de/webradio")
print r.text
编辑:如何使用gzip
和StringIO
解压缩数据而不保存到文件
import urllib
import gzip
import StringIO
req = urllib.urlopen("http://www.stream-urls.de/webradio")
# create file-like object in memory
buf = StringIO.StringIO(req.read())
# create gzip object using file-like object instead of real file on disk
f = gzip.GzipFile(fileobj=buf)
# get data from file
html = f.read()
print html