打印 python 美丽的汤对象时出现 Unicode 错误

Question

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup)

此代码产生以下错误

UnicodeEncodeError: 'ascii' codec can't encode character '\xbb' in position 1509: ordinal not in range(128)

我尝试了几种变通方法，但它们都有一些缺点。在Whosebug上搜索后，我找到了更改.stdout的解决方案，如下所示：

import bs4 as bs
import urllib.request
import sys
import codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
sys.stderr = codecs.getwriter('utf-8')(sys.stderr)

sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup)

我不再收到错误，但是，输出不再指向终端。我不确定为什么会这样。使用 .prettify('utf-8') 方法也可以消除错误并产生输出，但是，生成的对象是一个字符串，而不是一个漂亮的 soup 对象，因此具有 none 个相关的 bs 方法（例如 . find_all())。 .encode('utf-8') 方法会出现类似的问题。

此外，我注意到在输出中，美丽的汤对象中仍然有许多 \r 和 \n 字符，而不是纯 html 内容。

我想要一个漂亮的汤对象，没有任何我可以打印到终端的 \r 或 \n 字符。

Answer 1

在您的代码中：

sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

sauce 是 bytes 类型。当您将其传递到 bs.BeautifulSoup() 时，BeautifulSoup 尝试将这些字节解码为 ascii 字符串，但失败了，因为它实际上是一个 utf-8 字符串——根据Content-Type 响应 header (text/html; charset=utf-8) 以及 html 文档开头的 meta 标记 (<meta charset="utf-8" />).

bs.BeautifulSoup()、markup 的第一个参数采用 字符串或 file-like object 表示要解析的标记.您应该将这些字节显式解码为 utf-8 编码字符串，并使用它代替原始字节，如下所示：

sauce = urllib.request.urlopen('https://stats.swehockey.se/Teams/Info/TeamRoster/10333').read().decode('utf-8')
soup = bs.BeautifulSoup(sauce, 'lxml')

Also, I've noticed that in the output, there are many \r and \n characters still in the beautiful soup object instead of the pure html content.

I want a beautiful soup object without any of the \r or \n characters that I can print to the terminal.

\r和\n字符只是换行符的表示。如果您要打印这些，或在文本编辑器中查看它们，它们将显示为实际的换行符。

打印 python 美丽的汤对象时出现 Unicode 错误

Unicode errors when printing python beautiful soup object

python

unicode

beautifulsoup