Peter Norvig 的分词问题:如何对里面有拼写错误的词进行分词?
Peter Norvig's word segmentation issue: how can I segment words with misspellings inside?
我想了解 Peter Norvig 的拼写校正器是如何工作的。
在他的 jupyter-notebook 标题 here 中,他解释了如何在不使用空格分隔单词的情况下分割字符序列。它工作正常,当顺序中的所有单词都写正确时:
>>> segment("deeplearning")
['deep', 'learning']
但是当序列中的单词(或某些单词)拼写错误时,它会出错:
>>> segment("deeplerning")
['deep', 'l', 'erning']
不幸的是,我不知道如何解决这个问题并使 segment() 函数可以处理拼写错误的连接词。
有人知道如何处理这个问题吗?
It could be achieved by Peter Norvig's algorithm with minor changes. The trick is to add a space character to the alphabet and treat all bigrams separated by space character as a unique word.
Since big.txt doesn't contain deep learning
bigram we will have to add a little bit more text to our dictionary. I will use wikipedia library (pip install wikipedia
) to get more text.
import re
import wikipedia as wiki
import nltk
from nltk.tokenize import word_tokenize
unigrams = re.findall(r"\w+", open("big.txt").read().lower())
for deeplerning in wiki.search("Deep Learning"):
try:
page = wiki.page(deeplerning).content.lower()
page = page.encode("ascii", errors="ignore")
unigrams = unigrams + word_tokenize(page)
except:
break
I will create a new dictionary with all unigrams and bigrams:
fo = open("new_dict.txt", "w")
for u in unigrams:
fo.write(u + "\n")
bigrams = list(nltk.bigrams(unigrams))
for b in bigrams:
fo.write(" ".join(b)+ "\n")
fo.close()
Now just add a space
character to the letters
variable in edits1
function, change big.txt
to new_dict.txt
and change this function:
def words(text): return re.findall(r'\w+', text.lower())
to this:
def words(text): return text.split("\n")
and now correction("deeplerning")
returns 'deep learning'
!
This trick will perform well if you need a spelling corrector for specific domain. If this domain is big you can try to add to your dictionary only most common unigrams / bigrams.
This question also may help.
我想了解 Peter Norvig 的拼写校正器是如何工作的。
在他的 jupyter-notebook 标题 here 中,他解释了如何在不使用空格分隔单词的情况下分割字符序列。它工作正常,当顺序中的所有单词都写正确时:
>>> segment("deeplearning")
['deep', 'learning']
但是当序列中的单词(或某些单词)拼写错误时,它会出错:
>>> segment("deeplerning")
['deep', 'l', 'erning']
不幸的是,我不知道如何解决这个问题并使 segment() 函数可以处理拼写错误的连接词。
有人知道如何处理这个问题吗?
It could be achieved by Peter Norvig's algorithm with minor changes. The trick is to add a space character to the alphabet and treat all bigrams separated by space character as a unique word.
Since big.txt doesn't contain deep learning
bigram we will have to add a little bit more text to our dictionary. I will use wikipedia library (pip install wikipedia
) to get more text.
import re
import wikipedia as wiki
import nltk
from nltk.tokenize import word_tokenize
unigrams = re.findall(r"\w+", open("big.txt").read().lower())
for deeplerning in wiki.search("Deep Learning"):
try:
page = wiki.page(deeplerning).content.lower()
page = page.encode("ascii", errors="ignore")
unigrams = unigrams + word_tokenize(page)
except:
break
I will create a new dictionary with all unigrams and bigrams:
fo = open("new_dict.txt", "w")
for u in unigrams:
fo.write(u + "\n")
bigrams = list(nltk.bigrams(unigrams))
for b in bigrams:
fo.write(" ".join(b)+ "\n")
fo.close()
Now just add a space
character to the letters
variable in edits1
function, change big.txt
to new_dict.txt
and change this function:
def words(text): return re.findall(r'\w+', text.lower())
to this:
def words(text): return text.split("\n")
and now correction("deeplerning")
returns 'deep learning'
!
This trick will perform well if you need a spelling corrector for specific domain. If this domain is big you can try to add to your dictionary only most common unigrams / bigrams.
This question also may help.