Python3 - urllib.request.urlopen 和 readlines 到 utf-8？

Question

考虑这个例子：

import urllib.request # Python3 URL loading

filelist_url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
filelist_fobj = urllib.request.urlopen(filelist_url)
#filelist_fobj_fulltext = filelist_fobj.read().decode('utf-8')
#print(filelist_fobj_fulltext) # ok, works
lines = filelist_fobj.readlines()
print(type(lines[0]))

此代码打印出第一个条目的类型，readlines() 为 .urlopen()'d URL 的文件对象返回的结果为：

<class 'bytes'>

...事实上，返回列表中的所有条目都属于同一类型。

我知道我可以像注释行中那样做 .read().decode('utf-8')，然后在 \n 上拆分结果——但是，我想知道：还有其他方法吗, 将 urlopen 与 .readlines() 一起使用，并获取 ("utf-8") 字符串列表？

Answer 1

urllib.request.urlopen returns a http.client.HTTPResponse object, which implements the io.BufferedIOBase 接口，其中 returns 字节。

io模块提供了TextIOWrapper，可以包裹一个BufferedIOBaseobject（或其他类似的object）来添加编码。包装的 object 的 readlines 方法 returns str objects 根据您在创建 TextIOWrapper 时指定的编码解码，所以如果你得到正确的编码，一切都会起作用。（在 Unix-like 系统上，utf-8 是默认编码，但在 Windows 上显然不是这种情况。所以如果你想要可移植性，你需要提供一个编码。我会回来的一分钟后。）

因此以下工作正常：

>>> from urllib.request import urlopen
>>> from io import TextIOWrapper
>>> url="https://www.w3.org/TR/PNG/iso_8859-1.txt"
>>> with urlopen(url) as response:
...   lines = TextIOWrapper(response, encoding='utf-8'):
... 
>>> for line in lines[:5]: print(type(line), line.strip())
... 
<class 'str'> The following are the graphical (non-control) characters defined by
<class 'str'> ISO 8859-1 (1987).  Descriptions in words aren't all that helpful,
<class 'str'> but they're the best we can do in text.  A graphics file illustrating
<class 'str'> the character set should be available from the same archive as this
<class 'str'> file.

值得注意的是 HTTPResponse object 和包装它的 TextIOWrapper 都实现了迭代器协议，所以你可以使用像 for line in TextIOWrapper(response, ...): 这样的循环而不是使用 readlines() 保存整个网页。迭代器协议可能是一个巨大的胜利，因为它允许您在网页全部下载之前开始处理网页。

因为我在 Linux 系统上工作，所以我可以省略 TextIOWrapper 的 encoding='utf-8' 参数，但无论如何，假设我知道该文件是 UTF -8 编码。这是一个非常安全的假设，但它并不普遍有效。根据 W3Techs survey（每日更新，至少在我写这个答案时是这样），97.6% 的网站使用 UTF-8 编码，这意味着四十分之一的网站没有。（如果您将调查限制在 W3Techs 认为的前 1,000 个站点，则百分比会增加到 98.7%。但这仍然不是普遍的。）

现在，您会在许多 SO 答案中找到的传统智慧是，您应该从 HTTP headers 中挖掘编码，您可以很容易地做到这一点：

>>> # Tempting though this is, DO NOT DO IT. See below.
>>> with urlopen(url) as response:
...   lines = TextIOWrapper(response,
...                         encoding=response.headers.get_content_charset()
...                        ).readlines()
...

不幸的是，这只有在网站在 HTTP header 中声明内容编码时才有效，许多网站更喜欢将编码放在 meta 标记中。因此，当我使用 randomly-selected Windows-1252 编码的站点（取自 W3Techs 调查）尝试上述操作时，它因编码错误而失败：

>>> with urlopen(win1252_url) as response:
...   lines = TextIOWrapper(response, 
...                         encoding=response.headers.get_content_charset()
...                        ).readlines()
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 346: invalid continuation byte

请注意，虽然页面编码为 Windows-1252，但 HTTP headers 中未提供该信息，因此 TextIOWrapper 选择了默认编码，这在我的系统是UTF-8。如果我提供正确的编码，我可以毫无问题地阅读页面，让我看到页面本身的编码声明。

>>> with urlopen(win1252_url) as response:
...   lines = TextIOWrapper(response,
...                         encoding='Windows-1252'
...                        ).readlines()
... 
... print(lines[3].strip())>>> print(lines[3].strip())
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

显然，如果在内容本身中声明了编码，则无法在读取内容之前设置编码。那么遇到这些情况怎么办呢？

最通用且编码最简单的解决方案似乎是 well-known BeautifulSoup package，它能够使用多种技术来检测字符编码。不幸的是，这需要解析整个页面，这 time-consuming 的任务不仅仅是阅读行。

另一种选择是读取网页的第一个 KB 左右的字节，然后尝试查找 meta 标记。内容提供者应该将 meta 标记放在网页的开头附近，并且它肯定必须位于第一个 non-ascii 字符之前。如果您没有找到 meta 标记并且在 HTTP header 中没有声明字符编码，那么您可以尝试对已读取文件的字节使用启发式编码检测器。

您不应该做的一件事是依赖 HTTP header 中声明的字符编码，尽管有很多建议可以这样做，您可以在此处和网络上的其他地方找到这些建议。正如我们已经看到的，headers 通常不包含此信息，但即使包含，也通常是错误的，因为对于网页设计师来说，在页面本身中声明编码更容易而不是重新配置服务器以发送正确的 headers。所以你不能真正依赖 HTTP header，只有在你没有其他信息可以继续时才应该使用它。

Python3 - urllib.request.urlopen 和 readlines 到 utf-8？

Python3 - urllib.request.urlopen and readlines to utf-8?

urllib

utf-8

readlines

python-3.x