Unicode Encode Error : 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>
Unicode Encode Error : 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>
这是我要执行的代码,用于从图像中提取文本并保存在路径中。
def main():
path =r"D drive where images are stored"
fullTempPath =r"D drive where extracted texts are stored in xls file"
for imageName in os.listdir(path):
inputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
text = pytesseract.image_to_string(img, lang ="eng")
file1 = open(fullTempPath, "a+")
file1.write(imageName+"\n")
file1.write(text+"\n")
file1.close()
file2 = open(fullTempPath, 'r')
file2.close()
if __name__ == '__main__':
main()
我收到以下错误,有人可以帮我解决这个问题吗
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-7-fb69795bce29> in <module>
13 file2.close()
14 if __name__ == '__main__':
---> 15 main()
<ipython-input-7-fb69795bce29> in main()
8 file1 = open(fullTempPath, "a+")
9 file1.write(imageName+"\n")
---> 10 file1.write(text+"\n")
11 file1.close()
12 file2 = open(fullTempPath, 'r')
~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>
text = 'unicode error on this text'
text = text.decode('utf-8')
尝试解码文本
我不知道为什么 Tesseract 会返回包含无效 Unicode 字符的字符串,但这似乎是正在发生的事情。可以告诉 Python 忽略编码错误。尝试将打开输出文件的行更改为以下内容:
file1 = open(fullTempPath, "a+", errors="ignore")
用于 open
的默认文件编码是 locale.getpreferredencoding(False)
返回的值,在 Windows 上通常是不支持所有 Unicode 字符的旧编码。在这种情况下,错误消息表明它是 cp1252
(a.k.a Windows-1252)。最好明确指定您想要的编码。 UTF-8 处理所有 Unicode 字符:
file1 = open(fullTempPath, "a+", encoding='utf8')
仅供参考,U+FB01 是 LATIN SMALL LIGATURE FI (fi
) 如果这对正在处理的图像有意义。
此外,Windows 编辑倾向于采用相同的旧编码,除非编码是 utf-8-sig
,它将编码的 BOM 字符添加到文件的开头作为编码提示,表明它是 UTF- 8.
这是我要执行的代码,用于从图像中提取文本并保存在路径中。
def main():
path =r"D drive where images are stored"
fullTempPath =r"D drive where extracted texts are stored in xls file"
for imageName in os.listdir(path):
inputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
text = pytesseract.image_to_string(img, lang ="eng")
file1 = open(fullTempPath, "a+")
file1.write(imageName+"\n")
file1.write(text+"\n")
file1.close()
file2 = open(fullTempPath, 'r')
file2.close()
if __name__ == '__main__':
main()
我收到以下错误,有人可以帮我解决这个问题吗
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-7-fb69795bce29> in <module>
13 file2.close()
14 if __name__ == '__main__':
---> 15 main()
<ipython-input-7-fb69795bce29> in main()
8 file1 = open(fullTempPath, "a+")
9 file1.write(imageName+"\n")
---> 10 file1.write(text+"\n")
11 file1.close()
12 file2 = open(fullTempPath, 'r')
~\anaconda3\lib\encodings\cp1252.py in encode(self, input, final)
17 class IncrementalEncoder(codecs.IncrementalEncoder):
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20
21 class IncrementalDecoder(codecs.IncrementalDecoder):
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 2090: character maps to <undefined>
text = 'unicode error on this text'
text = text.decode('utf-8')
尝试解码文本
我不知道为什么 Tesseract 会返回包含无效 Unicode 字符的字符串,但这似乎是正在发生的事情。可以告诉 Python 忽略编码错误。尝试将打开输出文件的行更改为以下内容:
file1 = open(fullTempPath, "a+", errors="ignore")
用于 open
的默认文件编码是 locale.getpreferredencoding(False)
返回的值,在 Windows 上通常是不支持所有 Unicode 字符的旧编码。在这种情况下,错误消息表明它是 cp1252
(a.k.a Windows-1252)。最好明确指定您想要的编码。 UTF-8 处理所有 Unicode 字符:
file1 = open(fullTempPath, "a+", encoding='utf8')
仅供参考,U+FB01 是 LATIN SMALL LIGATURE FI (fi
) 如果这对正在处理的图像有意义。
此外,Windows 编辑倾向于采用相同的旧编码,除非编码是 utf-8-sig
,它将编码的 BOM 字符添加到文件的开头作为编码提示,表明它是 UTF- 8.