UnicodeDecodeError 告诉您导致错误的字符的位置。我怎样才能显示那个角色？

Question

当 opening/reading 文件使用类似

的内容时

with open(<csv_file>) as f:
    df = pandas.read_csv(f)

可能会出现

这样的错误

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 1678

我知道我可以使用 vscode 扩展名来定位 csv_file 中第 1678 位的字符。但是有什么方法可以用 python 做到这一点吗？天真地，类似的东西。

>>getCharInPosition(1678)
"The character is that position is 'x'"

或者更好，获取行号

>>getLineNumOfCharInPosition(1678)
"The line number for the character in that position is 25"

我正在寻找一种使标准 UnicodeDecodeError 消息比仅仅告诉我字符位置更有用的方法。

Answer 1

UnicodeError 的属性中包含相当多的信息。

通过捕获异常，您可以利用它找到有问题的字节：

try:
    df = pandas.read_csv(f)
except UnicodeError as e:
    offending = e.object[e.start:e.end]
    print("This file isn't encoded with", e.encoding)
    print("Illegal bytes:", repr(offending))
    raise

为了确定行号，您可以这样做（在 except 子句内）：

    seen_text = e.object[:e.start]
    line_no = seent_text.count(b'\n') + 1

...但我不确定 e.object 是否总是一个（字节）字符串（这可能会给大文件带来额外的麻烦），所以我不知道它是否总是有效。

此外，在 CSV 文件中，换行数可能大于逻辑行数，以防某些单元格内有换行符。

UnicodeDecodeError 告诉您导致错误的字符的位置。我怎样才能显示那个角色？

UnicodeDecodeError tells you position of character causing error. How can I display that character?

python

unicode

pandas

python-unicode