如何使用 python3 将数据框中的一个特定文本列转换为 'utf-8'

Question

我有一个数据框，其中多列和一列包含来自各种链接的 已删除 文本。我试图将该列转换为 utf-8，但没有成功。

这是我的方法：

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())
df['text'] = df['text'].apply(lambda x: x.encode('utf-8').strip())
print(df['text'])

我收到带有一些 ascii 代码的文本：

b"b'#谢谢，it\xe2\x80\x99s 很高兴...

df = pd.read_excel('data.xlsx',encoding=sys.getfilesystemencoding())
df['text'] = df['text']
print(df['text'])

我收到文字：

b'#谢谢，it\xe2\x80\x99s 很高兴来到这里....

df['text'] = df['text'].apply(lambda x: x.decode('utf-8').strip())

AttributeError: 'str' 对象没有属性 'decode'

我尝试了 2-3 种方法，但都没有用。还有其他选择吗？

使用 Python 3.6 和 jupyter notebook。

Answer 1

假设您为第二行 df['text'] = df['text'] 以 ' 结尾的示例所写的内容。换句话说，b'#Thank you, it\xe2\x80\x99s good to be here....':

出于某种原因，您将字节码转换为字符串，因为当您尝试解码它时会看到 AttributeError: 'str' object has no attribute 'decode'。（理想情况下，最好不要陷入这种情况，请参阅以获取一些看起来相关的建议。唉，随你所拥有的......）
我认为此时你可以删除字符串开头的 b' 和末尾的 ' 和 typecast back to byte code. Note that this will result in the backslashes getting escaped, and so that needs be dealt with, in addition to now decoding the byte code to a string in the proper way. Using an approach based on here 你可以转义和解码字节码。

将它与你显示的 df['text'] 放在一起（有点像 @rolf82 在评论中的说明），当 df['text'] = df['text'] 并且它在开头是一个字符串时，从你拥有的是这样的：

a = "b'#Thank you, it\xe2\x80\x99s good to be here'"
# But we only want the parts between the ''.
s = bytes(r"#Thank you, it\xe2\x80\x99s good to be here","utf-8")
import codecs
print(codecs.escape_decode(s)[0].decode("utf-8"))

这给出：

#Thank you, it’s good to be here

这就是我们想要的。

现在将它与 Pandas 集成将需要一些额外的东西，因为我们不能通过在前面添加 r 简单地说这是一个原始字符串。基于here and here，似乎前面使用r转换为原始字符串可以替换为.encode('unicode-escape').decode()，如：

"#Thank you, it\xe2\x80\x99s good to be here".encode('unicode-escape').decode()

所以把它们放在一起我会用这个替换你的第二行：

import codecs
df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape').decode(), "utf-8"))[0].decode('utf-8').strip())

如果这不起作用，也可以尝试在 .encode('unicode-escape') 之后去掉 .decode()，即：

```python
import codecs
df['text'] = df['text'].apply(lambda x: codecs.escape_decode(bytes(x[2:-1].encode('unicode-escape'), "utf-8"))[0].decode('utf-8').strip())

如何使用 python3 将数据框中的一个特定文本列转换为 'utf-8'

How to convert one particular text column in data-frame to 'utf-8' using python3

ascii

python-3.x