"Spell check" 和 return Python 中的更正项

Question

我最近从 pdf 文件目录中提取了文本数据。阅读 pdf 时，有时文本 returned 有点乱。

例如，我可以查看一个字符串：

"T he administrati on is doing bad things, and not fulfilling what it prom ised"

我想要的结果是：

"The administration is doing bad things, and not fulfilling what it promised"

我测试了我在 Whosebug 上找到的代码（使用 pyenchant 和 wx），它没有 return 我想要的。我的修改如下：

a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
    sug = err.suggest()[0]
    err.replace(sug)

c = chkr.get_text()#returns corrected text
print(c)

此代码returns:

"T he administrate on is doing bad things, and not fulfilling what it prom side"

我在 Windows 7 企业版 64 位上使用 Python 3.5.x。如果有任何建议，我将不胜感激！

Answer 1

看来您使用的附魔库不太好。它不会跨单词查找拼写错误，而只是单独查看单词。我想这是有道理的，因为函数本身被称为 'SpellChecker'.

我唯一能想到的就是寻找更好的自动更正库。也许这个可能有帮助？ https://github.com/phatpiglet/autocorrect

虽然没有保证。

Answer 2

我采纳了Generic Human’s answer，稍加修改就解决了你的问题

您需要将这些125k words, sorted by frequency复制到一个文本文件中，将文件命名为words-by-frequency.txt。

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
    words = [line.strip() for line in f.readlines()]
    wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
    maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

运行输入函数：

messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())


The administration is doing bad things and not fulfilling what it promised
>>>

编辑：下面的代码不需要文本文件，只需输入即可，即"T he administrati on is doing bad things, and not fulfilling what it prom ised"

from math import log

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
         "things", "and", "not", "fulfilling", "what",
         "it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""

    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))


messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"

print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())

The administration is doing bad things and not fulfilling what it promised
>>>

我刚刚在 repl.it 尝试了上述编辑，它打印了如图所示的输出。

"Spell check" 和 return Python 中的更正项

"Spell check" and return the corrected term in Python

python

nlp

spell-checking

python-3.x