如何构建一个简单的分词器
How to build a simple tokenizer
我想知道如何构建一个非常简单的分词器。给定字典 d(在本例中为列表)和句子 s,我想 return 句子的所有可能标记(=单词)。这是我尝试过的:
l = ["the","snow","ball","snowball","is","cold"]
sentence = "thesnowballisverycold"
def subs(string, ret=['']):
if len(string) == 0:
return ret
head, tail = string[0], string[1:]
ret = ret + list(map(lambda x: x+head, ret))
return subs(tail, ret)
print((list(set(subs(sentence))&set(l))))
但是这个 returns:
["snow","ball","cold","is","snowball","the"]
我可以比较子串,但一定有更好的方法,对吧?
我想要的:
["the","snowball","is","cold"]
您可以在此处使用正则表达式:
import re
l = ["the","snow","ball","snowball","is","cold"]
pattern = "|".join(sorted(l, key=len, reverse=True))
sentence = "thesnowballisverycold"
print( re.findall(pattern, sentence) )
# => ['the', 'snowball', 'is', 'cold']
参见Python demo。
模式看起来像 snowball|snow|ball|cold|the|is
,参见 regex demo online. The trick is to make sure all alternatives are listed from the longest to shortest. See 。 sorted(l, key=len, reverse=True)
部分按长度降序对 l
中的项目进行排序,"|".join(...)
创建交替模式。
我想知道如何构建一个非常简单的分词器。给定字典 d(在本例中为列表)和句子 s,我想 return 句子的所有可能标记(=单词)。这是我尝试过的:
l = ["the","snow","ball","snowball","is","cold"]
sentence = "thesnowballisverycold"
def subs(string, ret=['']):
if len(string) == 0:
return ret
head, tail = string[0], string[1:]
ret = ret + list(map(lambda x: x+head, ret))
return subs(tail, ret)
print((list(set(subs(sentence))&set(l))))
但是这个 returns:
["snow","ball","cold","is","snowball","the"]
我可以比较子串,但一定有更好的方法,对吧? 我想要的:
["the","snowball","is","cold"]
您可以在此处使用正则表达式:
import re
l = ["the","snow","ball","snowball","is","cold"]
pattern = "|".join(sorted(l, key=len, reverse=True))
sentence = "thesnowballisverycold"
print( re.findall(pattern, sentence) )
# => ['the', 'snowball', 'is', 'cold']
参见Python demo。
模式看起来像 snowball|snow|ball|cold|the|is
,参见 regex demo online. The trick is to make sure all alternatives are listed from the longest to shortest. See sorted(l, key=len, reverse=True)
部分按长度降序对 l
中的项目进行排序,"|".join(...)
创建交替模式。