使用特定 python 库(性别检测器)时出现 UnicodeDecodeError
UnicodeDecodeError when using specific python library (gender-detector)
我需要进行性别猜测以进行一些分析,经过一些研究,我在 github 上找到了这个 Python 库:malev/gender-detector
按照说明进行一些调整后(例如自述文件指示 import gender_detector as gd
但我需要做
from gender_detector import gender_detector as gd
然后出现这种情况,lib有4个数据集,'us','uk','ar','uy',但只有在使用'us'或'uk'
参见下面的示例:
from gender_detector import gender_detector as gd
detector = gd.GenderDetector('us')
detector2 = gd.GenderDetector('ar')
detector.guess('Marcos')
Out[25]: 'male'
detector2.guess('Marcos')
Traceback (most recent call last):
File "", line 1, in
detector2.guess('Marcos')
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/gender_detector.py", line 25, in guess
initial_position = self.index(name[0])
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 19, in call
self._generate_index()
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 25, in _generate_index
total = file.readline() # Omit headers line
File "/home/cpneto/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1078: invalid continuation byte
我相信这是因为 py2 与 py3 的兼容性,但我不确定,也不知道如何解决这个问题。
有什么建议吗?
库假定您的 ar
文件是 UTF-8 编码的,但它不是(因此出现 byte 0xf1 in position 1078
错误)。您需要将文件转换为 UTF-8 或找到某种方法将实际编码传递给库。
我需要进行性别猜测以进行一些分析,经过一些研究,我在 github 上找到了这个 Python 库:malev/gender-detector
按照说明进行一些调整后(例如自述文件指示 import gender_detector as gd
但我需要做
from gender_detector import gender_detector as gd
然后出现这种情况,lib有4个数据集,'us','uk','ar','uy',但只有在使用'us'或'uk'
参见下面的示例:
from gender_detector import gender_detector as gd
detector = gd.GenderDetector('us')
detector2 = gd.GenderDetector('ar')
detector.guess('Marcos')
Out[25]: 'male'
detector2.guess('Marcos')
Traceback (most recent call last):
File "", line 1, in
detector2.guess('Marcos')
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/gender_detector.py", line 25, in guess
initial_position = self.index(name[0])
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 19, in call
self._generate_index()
File "/home/cpneto/anaconda3/lib/python3.6/site-packages/gender_detector/index.py", line 25, in _generate_index
total = file.readline() # Omit headers line
File "/home/cpneto/anaconda3/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 1078: invalid continuation byte
我相信这是因为 py2 与 py3 的兼容性,但我不确定,也不知道如何解决这个问题。
有什么建议吗?
库假定您的 ar
文件是 UTF-8 编码的,但它不是(因此出现 byte 0xf1 in position 1078
错误)。您需要将文件转换为 UTF-8 或找到某种方法将实际编码传递给库。