'utf-16-le' 编解码器在读取 PYTHON 中的 EXCEL 时无法解码字节

Question

我正在尝试读取不同数量的不同语言的 xls 文件，阿拉伯语、希腊语、意大利语、希伯来语等，当我尝试调用 open_workbook 函数时出现如下所示的错误，任何想法如何将格式设置为任何语言？

代码：

book = xlrd.open_workbook(workbook_url)

错误：

return codecs.utf_16_le_decode(input, errors, True) UnicodeDecodeError: 'utf-16-le' codec can't decode bytes in position 372-373: unexpected end of data

Answer 1

语言不太可能是问题所在。更有可能是 xlrd 在检测 .xlsx 文件的编码时遇到问题。

正如 documentation on handling of unicode 中的 xlrd 注释：

This package presents all text strings as Python unicode objects. From Excel 97 onwards, text in Excel spreadsheets has been stored as UTF-16LE (a 16-bit Unicode Transformation Format). Older files (Excel 95 and earlier) don’t keep strings in Unicode; a CODEPAGE record provides a codepage number (for example, 1252) which is used by xlrd to derive the encoding (for same example: “cp1252”) which is used to translate to Unicode.

我看这个的第一步是确定实际的编码。该文件有多旧以及它是如何创建的（实际 Excel？或通过第 3 方工具）。

您可以通过在 text/hex 编辑器中打开文件来查找 CODEPAGE 记录，然后尝试强制使用该编码。

在我看来，它不是 utf-16le（xlrd 的默认假设）的错误，因此您将不得不以某种方式确定它，否则就开始尝试随机编码，例如：

book = xlrd.open_workbook(..., encoding_override="cp1252")
book = xlrd.open_workbook(..., encoding_override="utf-8")
book = xlrd.open_workbook(..., encoding_override="latin-1")

'utf-16-le' 编解码器在读取 PYTHON 中的 EXCEL 时无法解码字节

'utf-16-le' codec can't decode bytes while reading EXCEL in PYTHON

python

excel

xlrd