Python 2.7：编码为 UTF-8 时出现问题

Question

我有一个数据框，其中有一列 _text，其中包含一篇文章的文本。我正在尝试获取数据框中每一行的文章长度。这是我的尝试：

from bs4 import BeautifulSoup
result_df['_text'] = [BeautifulSoup(text, "lxml").get_text() for text in result_df['_text']]

text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

不幸的是，我得到这个错误：

    ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-8-f6c8ab83a46f> in <module>()
----> 1 text_word_length = [len(str(x).split(" ")) for x in result_df['_text']]

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 231: ordinal not in range(128)

好像我应该在某处指定 "utf-8"，我只是不确定在哪里...

谢谢！

Answer 1

根据官方 python 文档： Python Official Site

要定义源代码编码，必须在源文件中作为文件的第一行或第二行放置魔术注释，例如：

# coding=<encoding name>

或（使用流行编辑器识别的格式）：

#!/usr/bin/python
# -*- coding: <encoding name> -*-

或：

#!/usr/bin/python
# vim: set fileencoding=<encoding name> :

Answer 2

我假设您使用的是 Python 2 版本，并且您的输入文本包含非 ASCII 字符。问题出现在 str(x) 处，默认情况下，当 x 是一个 unicode 字符串时，以 x.encode('ascii')

结尾

你有两种方法可以解决这个问题：

正确编码 utf-8 中的 unicode 字符串:

text_word_length = [len(x.encode('utf-8').split(" ")) for x in result_df['_text']]

将字符串拆分为 unicode:

text_word_length = [len(x.split(u" ")) for x in result_df['_text']]

Python 2.7：编码为 UTF-8 时出现问题

Python 2.7: Trouble Encoding to UTF-8

python

encoding

utf