urllib read() 改变属性

Question

我有一个基本的脚本，它正在请求网站获取 html 源代码。在抓取多个网站时，我发现源代码中的不同属性表示错误。

示例：

from urllib import request

opener = request.build_opener()
with opener.open("https://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2") as response:
    html = response.read()
print(html)

我将结果 (html var) 与 Chrome 和 Firefox 代表的源代码进行了比较。

我看到了这样的差异：

Browser                        Urllib

href='rfc2616.html'            href=\'rfc2616.html\'
rev='Section'                  rev=\'Section\'
rel='xref'                     rel=\'xref\'
id='sec4.5'                    id=\'sec4.4\'

看起来 urllib 在此处放置反斜杠以转义代码。

这是 urllib 内部的一个 bug 还是有什么办法可以解决这个问题？

提前致谢。

Answer 1

responce.read() 将 return 一个 bytes 对象，打印时其转义序列不会被解释，请参阅：

print(b'hello\nworld') # prints b'hello\nworld'

您需要将其 decode 转换为 str，这在打印时会正确评估转义：

print(html.decode())

urllib read() 改变属性

urllib read() changing attributes

python

urllib

python-3.x