Pandas 带有文本的 Dataframe 和编码某些字符的问题

Question

我有一个数据集，其中包含一个包含一些文本（歌词）的列。

有时在文本中有单词（或符号）没有被正确解码，这里有一个例子：

'I keep trying Ainâ\x80\x99t no denyingWe should be together nowI canâ\x80\x99t imagineYouâ\x80\x99re with another man Baby'

在这种情况下，搜索原始歌词，那些 "codes" (â\x80\x99) 表示单引号 - 撇号 - 但我有很多行，我无法检查每一行，我还有来自俄语、中文、希腊语等语言的文本...

我想使用正则表达式并找到所有这些子字符串，但我不知道模式是否相同（一个字母、两个反斜杠、x 加两个数字）

或者只是一些编码参数 "read" 所有字符？

感谢您的帮助！

Answer 1

如果我答对了你的问题，你需要找到正确的文件编码。

找到这样的文件编码：

# import the chardet library
import chardet 

# use the detect method to find the encoding
# 'rb' means read in the file as binary
with open(your_file, 'rb') as file:
    print(chardet.detect(file.read()))

此代码片段将打印文件的正确编码，如下所示：

{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

现在用正确的编码打开你的文件。

如果您没有安装 chardet 库：

pip install chardet

希望对您有所帮助。

Pandas 带有文本的 Dataframe 和编码某些字符的问题

Pandas Dataframe with text and problems with encoding some characters

python

regex

decoding

character-encoding

pandas