"Spell check" 和 return Python 中的更正项
"Spell check" and return the corrected term in Python
我最近从 pdf 文件目录中提取了文本数据。阅读 pdf 时,有时文本 returned 有点乱。
例如,我可以查看一个字符串:
"T he administrati on is doing bad things, and not fulfilling what it
prom ised"
我想要的结果是:
"The administration is doing bad things, and not fulfilling what it
promised"
我测试了我在 Whosebug 上找到的代码(使用 pyenchant 和 wx),它没有 return 我想要的。我的修改如下:
a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
sug = err.suggest()[0]
err.replace(sug)
c = chkr.get_text()#returns corrected text
print(c)
此代码returns:
"T he administrate on is doing bad things, and not fulfilling what it
prom side"
我在 Windows 7 企业版 64 位上使用 Python 3.5.x。如果有任何建议,我将不胜感激!
看来您使用的附魔库不太好。它不会跨单词查找拼写错误,而只是单独查看单词。我想这是有道理的,因为函数本身被称为 'SpellChecker'.
我唯一能想到的就是寻找更好的自动更正库。
也许这个可能有帮助?
https://github.com/phatpiglet/autocorrect
虽然没有保证。
我采纳了Generic Human’s answer,稍加修改就解决了你的问题
您需要将这些125k words, sorted by frequency复制到一个文本文件中,将文件命名为words-by-frequency.txt
。
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
words = [line.strip() for line in f.readlines()]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
运行 输入函数:
messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())
The administration is doing bad things and not fulfilling what it promised
>>>
编辑:下面的代码不需要文本文件,只需输入即可,即"T he administrati on is doing bad things, and not fulfilling what it prom ised"
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
"things", "and", "not", "fulfilling", "what",
"it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())
The administration is doing bad things and not fulfilling what it promised
>>>
我刚刚在 repl.it 尝试了上述编辑,它打印了如图所示的输出。
我最近从 pdf 文件目录中提取了文本数据。阅读 pdf 时,有时文本 returned 有点乱。
例如,我可以查看一个字符串:
"T he administrati on is doing bad things, and not fulfilling what it prom ised"
我想要的结果是:
"The administration is doing bad things, and not fulfilling what it promised"
我测试了我在 Whosebug
a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
sug = err.suggest()[0]
err.replace(sug)
c = chkr.get_text()#returns corrected text
print(c)
此代码returns:
"T he administrate on is doing bad things, and not fulfilling what it prom side"
我在 Windows 7 企业版 64 位上使用 Python 3.5.x。如果有任何建议,我将不胜感激!
看来您使用的附魔库不太好。它不会跨单词查找拼写错误,而只是单独查看单词。我想这是有道理的,因为函数本身被称为 'SpellChecker'.
我唯一能想到的就是寻找更好的自动更正库。 也许这个可能有帮助? https://github.com/phatpiglet/autocorrect
虽然没有保证。
我采纳了Generic Human’s answer,稍加修改就解决了你的问题
您需要将这些125k words, sorted by frequency复制到一个文本文件中,将文件命名为words-by-frequency.txt
。
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
words = [line.strip() for line in f.readlines()]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
运行 输入函数:
messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())
The administration is doing bad things and not fulfilling what it promised
>>>
编辑:下面的代码不需要文本文件,只需输入即可,即"T he administrati on is doing bad things, and not fulfilling what it prom ised"
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
"things", "and", "not", "fulfilling", "what",
"it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())
The administration is doing bad things and not fulfilling what it promised
>>>
我刚刚在 repl.it 尝试了上述编辑,它打印了如图所示的输出。