NLP 如何在充满短信的 147k 行上加速拼写校正

Question

尝试加快对 147k 行的大型数据集的拼写检查。下面这个函数已经运行了一个下午，现在还是运行。有没有办法加快拼写检查？这些消息已经过大小写处理、标点符号删除、词形还原，并且全部采用字符串格式。

import autocorrect
from autocorrect import Speller
spell = Speller()

def spell_check(x):
    correct_word = []
    mispelled_word = x.split()
    for word in mispelled_word:
        correct_word.append(spell(word))
    return ' '.join(correct_word)

df['clean'] = df['old'].apply(spell_check)

Answer 1

autocorrect 库效率不高，不是为您提供的任务而设计的。它所做的是生成所有可能有一个或两个错别字的候选词，并检查其中哪些是有效词——并以 Python.

的形式进行

取一个six-letter字如"source":

from autocorrect.typos import Word
print(sum(1 for c in Word('source').typos()))
# => 349
print(sum(1 for c in Word('source').double_typos()))
# => 131305

autocorrect 生成多达 131654 个候选人进行测试，仅针对这个词。如果再长一点呢？让我们试试 "transcompilation":

print(sum(1 for c in Word('').typos()))
# => 889
print(sum(1 for c in Word('').double_typos()))
# => 813325

814214位考生，一个字！请注意，numpy 无法加快速度，因为值是 Python 字符串，并且您要在每一行上调用 Python 函数。加快速度的唯一方法是更改用于 spell-checking 的方法：例如，改用 aspell-python-py3 库（aspell 的包装器，AFAIK 最好的免费 Unix 拼写检查器).

Answer 2

此外，@Amadan 所说的绝对正确（自动更正以一种非常无效的方式进行更正）：

您将巨型数据集中的每个单词视为第一次查找其中的所有单词，因为您对每个单词都调用了 spell()。实际上（至少在一段时间后）几乎所有的词都是以前查找过的，所以存储这些结果并加载它们会更有效率。

这是一种方法：

import autocorrect
from autocorrect import Speller
spell = Speller()

# get all unique words in the data as a set (first split each row into words, then put them all in a flat set)
unique_words = {word for words in df["old"].apply(str.split) for word in words}

# get the corrected version of each unique word and put this mapping in a dictionary
corrected_words = {word: spell(word) for word in unique_words}

# write the cleaned row by looking up the corrected version of each unique word
df['clean'] = [" ".join([corrected_words[word] for word in row.split()]) for row in df["old"]]

NLP 如何在充满短信的 147k 行上加速拼写校正

NLP how to speed up spelling correction on 147k rows filled with short messages

python

nlp

nltk