Python 从文件中读取文本未找到撇号

Question

文本清理功能

def clean_before_tok(text):
    text=text.replace("'"," ")
    exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
    for e in exclude:
        text=text.replace(e," ")
    return text

我可以在宠物示例上进行测试

test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac

但是当使用

读取文件时

generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')

这不是查找替换撇号。是否存在编码缺陷？

Answer 1

为了检查文件的编码，您可以将其打印为字节

>>> with open("my-file.txt", "rb") as file:
...     b_file = file.read()
>>> print(b_file)

如果撇号显示为撇号，那就很奇怪了。通常情况下，该问题将由您的文本中出现奇怪的 \xAB（AB 可以是任何大写或小写字母，它们代表一个 non-ASCII 字节）来解释。

Python 从文件中读取文本未找到撇号

Python reading text from file not finding apostrophes

python

encoding

text

nlp

readfile