在 pandas 中创建字符串列 consistent/clean

Question

我正在处理具有 "unclean" 字符串列的数据集。这些是公司名称，其中大部分是手动输入的，因此存在错别字和不同的表示形式。数据集列看起来像这样：

company_name
big compnay
big company
big company inc.
smll compny
small company
small inc.

我正在尝试将上面的列编辑为如下内容：

company_name
big company
big company
big company
small company
small company
small company

数据点的数量远远大于可以手动清理的数量。我真的很感激任何 suggestions/help/advice。我试过使用 fuzzywuzzy 等模块，但我想不出解决上述问题的最佳方法。

谢谢。

Answer 1

您可以利用概率拼写校正器来校正与数据集中出现频率高得多的词相差一到两个编辑距离的词。此处提供了 Python 实现：http://norvig.com/spell-correct.html

Making string column consistent/clean in pandas