utf-8 编码不适用于所有德语字符

Question

我读了这样一个地理 pandas 文件：

file = gpd.read_file('./County.shp', encoding='utf-8')
file.head()

在某些情况下，编码效果很好。例如，没有编码，它是GÃ¶ttingen，但有编码，它是Göttingen。

但是，它并不适用于所有情况。例如，Gebietseinheit Mittelfranken ohne Großstadte 读作 b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'

我该如何解决这个问题？

Answer 1

\xdf 是 ß；同样，\xe4 是 ä:

>>> '\xdf'
'ß'

>>> '\xe4'
'ä'

所以编码没有问题。

真的，是因为文件读入了bytes字符串，也就是b前缀的意思：

>>> b'\xdf'
b'\xdf'

>>> b'\xdf'
b'\xe4'

所以它们是相同的值，但 Python 只是显示方式不同。

另外：

# With the b prefix:
>>> b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'

# Without the b prefix:
>>> 'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
'Gebietseinheit Kassel ohne Großstädte'

如果要打印带有特殊字符的字符串看起来正常，请使用 bytes.decode 将其转换为 str，使用 latin 编码:

>>> bytes_str = b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'
>>> bytes_str
b'Gebietseinheit Kassel ohne Gro\xdfst\xe4dte'

>>> normal_str = bytes_str.decode('latin1')
>>> normal_str
'Gebietseinheit Kassel ohne Großstädte'

utf-8 编码不适用于所有德语字符

encoding utf-8 doesn't work with all German characters

encoding

character-encoding

python-3.x

pandas