Python 从文件中读取文本未找到撇号

Python reading text from file not finding apostrophes

文本清理功能

def clean_before_tok(text):
    text=text.replace("'"," ")
    exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
    for e in exclude:
        text=text.replace(e," ")
    return text

我可以在宠物示例上进行测试

test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac

但是当使用

读取文件时
generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')

这不是查找替换撇号。是否存在编码缺陷?

为了检查文件的编码,您可以将其打印为字节

>>> with open("my-file.txt", "rb") as file:
...     b_file = file.read()
>>> print(b_file)

如果撇号显示为撇号,那就很奇怪了。通常情况下,该问题将由您的文本中出现奇怪的 \xABAB 可以是任何大写或小写字母,它们代表一个 non-ASCII 字节)来解释。