Python 从文件中读取文本未找到撇号
Python reading text from file not finding apostrophes
文本清理功能
def clean_before_tok(text):
text=text.replace("'"," ")
exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
for e in exclude:
text=text.replace(e," ")
return text
我可以在宠物示例上进行测试
test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac
但是当使用
读取文件时
generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')
这不是查找替换撇号。是否存在编码缺陷?
为了检查文件的编码,您可以将其打印为字节
>>> with open("my-file.txt", "rb") as file:
... b_file = file.read()
>>> print(b_file)
如果撇号显示为撇号,那就很奇怪了。通常情况下,该问题将由您的文本中出现奇怪的 \xAB
(AB
可以是任何大写或小写字母,它们代表一个 non-ASCII 字节)来解释。
文本清理功能
def clean_before_tok(text):
text=text.replace("'"," ")
exclude=[" le "," la "," l "," un "," une "," du "," de "," les "," des "," s "," d "]
for e in exclude:
text=text.replace(e," ")
return text
我可以在宠物示例上进行测试
test=clean_before_tok("dlkj dfg le se d'ac")
print(test)
>>> dlkj dfg se ac
但是当使用
读取文件时generated_text=open("text-like.txt", 'rb').read().decode(encoding='utf-8')
这不是查找替换撇号。是否存在编码缺陷?
为了检查文件的编码,您可以将其打印为字节
>>> with open("my-file.txt", "rb") as file:
... b_file = file.read()
>>> print(b_file)
如果撇号显示为撇号,那就很奇怪了。通常情况下,该问题将由您的文本中出现奇怪的 \xAB
(AB
可以是任何大写或小写字母,它们代表一个 non-ASCII 字节)来解释。