如何有效地对 Python 中的大型文本语料库使用拼写校正
How to efficiently use spell correction for a large text corpus in Python
考虑以下 spell-correction:
from autocorrect import spell
import re
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
以上代码的输出如下:
如何停止在 jupyter notebook 中显示输出?特别是如果我们有大量的文本文档,输出会很多。
我的第二个问题是:在大数据上应用时,如何提高代码的速度和准确性(例如请检查输出中的“veri”一词)?有没有更好的方法来做到这一点?感谢您的回复和更快的(替代)解决方案。
正如@khelwood 在评论中所说,你应该使用 autocorrect.Speller
:
from autocorrect import Speller
import re
spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']
作为替代方案,您可以使用列表理解来提高速度,也可以使用库 pyspellchecker
,在这种情况下可以提高单词 'veri'
的准确性:
from spellchecker import SpellChecker
import re
WORD = re.compile(r'\w+')
spell = SpellChecker()
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = [' '.join([spell.correction(w).lower() for w in reTokenize(doc)]) for doc in text]
return sptext
print(spell_correct(text))
输出:
['hi welcome to spelling', 'this is just an example but consider a very big corpus']
考虑以下 spell-correction:
from autocorrect import spell
import re
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
以上代码的输出如下:
如何停止在 jupyter notebook 中显示输出?特别是如果我们有大量的文本文档,输出会很多。
我的第二个问题是:在大数据上应用时,如何提高代码的速度和准确性(例如请检查输出中的“veri”一词)?有没有更好的方法来做到这一点?感谢您的回复和更快的(替代)解决方案。
正如@khelwood 在评论中所说,你应该使用 autocorrect.Speller
:
from autocorrect import Speller
import re
spell=Speller(lang="en")
WORD = re.compile(r'\w+')
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = []
for doc in text:
sptext.append(' '.join([spell(w).lower() for w in reTokenize(doc)]))
return sptext
print(spell_correct(text))
#Output
#['hi welcome to spelling', 'this is just an example but consider a veri big corpus']
作为替代方案,您可以使用列表理解来提高速度,也可以使用库 pyspellchecker
,在这种情况下可以提高单词 'veri'
的准确性:
from spellchecker import SpellChecker
import re
WORD = re.compile(r'\w+')
spell = SpellChecker()
def reTokenize(doc):
tokens = WORD.findall(doc)
return tokens
text = ["Hi, welcmoe to speling.","This is jsut an exapmle, but cosnider a veri big coprus."]
def spell_correct(text):
sptext = [' '.join([spell.correction(w).lower() for w in reTokenize(doc)]) for doc in text]
return sptext
print(spell_correct(text))
输出:
['hi welcome to spelling', 'this is just an example but consider a very big corpus']