Hunspell for Portuguese 将正确拼写的单词显示为拼写错误

Question

我使用的是最新版本的spacy_hunspell with Portuguese dictionaries。而且，我意识到当我使用包含特殊字符（例如重音符 (`) 和波浪号 (~)）的变形动词时，拼写检查器无法检索正确的验证：

import hunspell

spellchecker = hunspell.HunSpell('/usr/share/hunspell/pt_PT.dic',
                                 '/usr/share/hunspell/pt_PT.aff')

#Verb: fazer
spellchecker.spell('fazer') # True, correct
spellchecker.spell('faremos') # True, correct
spellchecker.spell('fará') # False, incorrect
spellchecker.spell('fara') # True, incorrect
spellchecker.spell('farão') # False, incorrect

#Verb: andar
spellchecker.spell('andar') # True, correct
spellchecker.spell('andamos') # True, correct
spellchecker.spell('andará') # False, incorrect
spellchecker.spell('andara') # True, correct

#Verb: ouvir
spellchecker.spell('ouvir') # True, correct
spellchecker.spell('ouço') # False, incorrect

另一个问题是当动词不规则时，如ir：

spellchecker.spell('vamos') # False, incorrect
spellchecker.spell('vai') # False, incorrect
spellchecker.spell('iremos') # True, correct
spellchecker.spell('irá') # False, incorrect

据观察，带有特殊字符的名词不会出现此问题：

spellchecker.spell('coração') # True, correct
spellchecker.spell('órgão') # True, correct
spellchecker.spell('óbvio') # True, correct
spellchecker.spell('pivô') # True, correct

有什么建议吗？

Answer 1

这个问题是关于 hunspell 而不是 spacy 或 spacy_hunspell。

我认为这是一个编码问题，尽管在您的所有测试用例中它看起来都不像。我不确定您是如何找到这些葡萄牙语词典的，但它们不是 UTF-8 格式，也不是来自 LibreOffice 的 current/standard hunspell pt_PT 库：

https://github.com/LibreOffice/dictionaries/tree/master/pt_PT

这些是 debian/ubuntu 安装的葡萄牙语词典，如果您安装 hunspell-pt-pt 包（例如，使用 apt-get install hunspell-pt-pt）并且它们对上述测试用例具有正确的行为，要么在命令行上使用 hunspell 或在上面的代码中使用 pyhunspell。

Answer 2

澄清一些重要的想法：拼写检查，连同词形还原，通常使用一组预定义规则（是的，没有机器学习，也没有广泛的注释同义词库）。但是，正如您所注意到的，其中一些规则不适用于不规则动词和屈曲词。

事实证明，与其他语言相比，Spacy 模型和规则（实际上不仅是 spacy，还有任何适用于葡萄牙语的工具）都非常薄弱。

结论：你得到错误的结果不是因为你犯了任何错误，而是因为spacy（和hunspell）提供的模型是错误的.

作为一个开源项目，您可以尝试自己增强它。如果这不是一个选择，您可以尝试其他一些工具，例如 dicio （这是基于同义词库的，但速度很慢，因为您必须将它与 Ajax 集成并且需要请求每一个字！）

欢迎来到葡萄牙语 NLP！

Hunspell for Portuguese 将正确拼写的单词显示为拼写错误

Hunspell for Portuguese shows correctly spelled words as misspelled

python

nlp

hunspell

python-3.x