chardet.detectreturn空语

Question

我正在使用 chardet.detect 来检测字符串的语言，就像建议的解决方案之一

我的代码如下所示：

import chardet

print(chardet.detect('test'.encode()))
print(chardet.detect('בדיקה'.encode()))
print(chardet.detect('тест'.encode()))
print(chardet.detect('テスト'.encode()))

我得到的结果是这样的：

{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

我的预期结果应该是这样的：

{'encoding': 'ascii', 'confidence': 1.0, 'language': 'English'}
{'encoding': 'utf-8', 'confidence': 0.9690625, 'language': 'Hebrew'}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': 'Russian'}
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': 'Japanese'}

我更喜欢使用 chardet 作为我的解决方案，因为我已经将它导入到我的应用程序中，并且我希望它尽可能保持苗条

Answer 1

chardet 模块不太擅长检测字符集或语言。基于中列出的选项，我发现 pyCLD3 易于安装，并且即使使用相当短的文本片段也能提供良好的检测，即使对于像您的测试这样的单个单词并不完美：

>>> cld3.get_language("test")                                              
LanguagePrediction(language='ko', probability=0.3396911025047302, is_reliable=False, proportion=1.0)

>>> cld3.get_language("בדיקה")                                             
LanguagePrediction(language='iw', probability=0.9995728731155396, is_reliable=True, proportion=1.0)

>>> cld3.get_language("тест")                                              
LanguagePrediction(language='bg', probability=0.9895398616790771, is_reliable=True, proportion=1.0)

>>> cld3.get_language("テスト")                                            
LanguagePrediction(language='ja', probability=1.0, is_reliable=True, proportion=1.0)

看起来是四分之三，因为 тест 也是保加利亚语。 langid 模块获得了所有这些权利，因此这也可能是一个不错的选择。

chardet.detectreturn空语

chardet.detect return empty language

python

python-3.x

chardet