Python 中的模糊文本搜索
Fuzzy text search in Python
想知道有没有Python库可以进行模糊文本搜索。例如:
- 我有三个关键字“信件”、“邮票”和“邮件”.
- 我想要一个函数来检查这三个词是否在
同一段(或一定距离,一页)。
- 此外,这些词必须保持相同的顺序。这三个字中间有其他字也无妨
我试过 fuzzywuzzy,但没有解决我的问题。另一个库Whoosh,看起来很强大,但我没有找到合适的功能。
{1}
您可以在 Whoosh 2.7
中执行此操作。它通过添加插件 whoosh.qparser.FuzzyTermPlugin
:
来进行模糊搜索
whoosh.qparser.FuzzyTermPlugin
lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).
添加模糊插件:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
将模糊插件添加到解析器后,您可以通过添加 ~
后跟可选的最大编辑距离来指定模糊项。如果不指定编辑距离,则默认为 1。
例如,下面的“模糊”词查询:
letter~
letter~2
letter~2/3
{2} 为了保持单词的顺序,请使用查询 whoosh.query.Phrase
但您应该将 Phrase
插件替换为 whoosh.qparser.SequencePlugin
允许你在短语中使用模糊术语:
"letter~ stamp~ mail~"
要用序列插件替换默认的短语插件:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())
{3} 要允许单词之间存在单词,请将短语查询中的 slop
arg 初始化为更大的数字:
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
您也可以像这样在查询中定义斜率:
"letter~ stamp~ mail~"~10
{4}整体解决方案:
{4.a} Indexer 就像:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first, mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()
{4.b} Searcher 会像:
from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin
with ix.searcher() as searcher:
parser = QueryParser(u"content", ix.schema)
parser.add_plugin(FuzzyTermPlugin())
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
results = searcher.search(query)
print "nb of results =", len(results)
for r in results:
print r
给出结果:
nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>
{5} 如果您想将模糊搜索设置为默认设置而不在查询的每个单词中使用语法 word~n
,您可以初始化 QueryParser
像这样:
from whoosh.query import FuzzyTerm
parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
现在您可以使用查询 "letter stamp mail"~10
,但请记住 FuzzyTerm
具有默认编辑距离 maxdist = 1
。如果您想要更大的编辑距离,请个性化 class:
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
# super().__init__() for Python 3 I think
参考文献:
想知道有没有Python库可以进行模糊文本搜索。例如:
- 我有三个关键字“信件”、“邮票”和“邮件”.
- 我想要一个函数来检查这三个词是否在 同一段(或一定距离,一页)。
- 此外,这些词必须保持相同的顺序。这三个字中间有其他字也无妨
我试过 fuzzywuzzy,但没有解决我的问题。另一个库Whoosh,看起来很强大,但我没有找到合适的功能。
{1}
您可以在 Whoosh 2.7
中执行此操作。它通过添加插件 whoosh.qparser.FuzzyTermPlugin
:
whoosh.qparser.FuzzyTermPlugin
lets you search for “fuzzy” terms, that is, terms that don’t have to match exactly. The fuzzy term will match any similar term within a certain number of “edits” (character insertions, deletions, and/or transpositions – this is called the “Damerau-Levenshtein edit distance”).
添加模糊插件:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.add_plugin(qparser.FuzzyTermPlugin())
将模糊插件添加到解析器后,您可以通过添加 ~
后跟可选的最大编辑距离来指定模糊项。如果不指定编辑距离,则默认为 1。
例如,下面的“模糊”词查询:
letter~
letter~2
letter~2/3
{2} 为了保持单词的顺序,请使用查询 whoosh.query.Phrase
但您应该将 Phrase
插件替换为 whoosh.qparser.SequencePlugin
允许你在短语中使用模糊术语:
"letter~ stamp~ mail~"
要用序列插件替换默认的短语插件:
parser = qparser.QueryParser("fieldname", my_index.schema)
parser.remove_plugin_class(qparser.PhrasePlugin)
parser.add_plugin(qparser.SequencePlugin())
{3} 要允许单词之间存在单词,请将短语查询中的 slop
arg 初始化为更大的数字:
whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
slop – the number of words allowed between each “word” in the phrase; the default of 1 means the phrase must match exactly.
您也可以像这样在查询中定义斜率:
"letter~ stamp~ mail~"~10
{4}整体解决方案:
{4.a} Indexer 就像:
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", content=u"The second one is even more interesting!")
writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third")
writer.add_document(title=u"Fourth document", content=u"stamp first, mail third")
writer.add_document(title=u"Fivth document", content=u"letter first, mail third")
writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong")
writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third")
writer.commit()
{4.b} Searcher 会像:
from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin
with ix.searcher() as searcher:
parser = QueryParser(u"content", ix.schema)
parser.add_plugin(FuzzyTermPlugin())
parser.remove_plugin_class(PhrasePlugin)
parser.add_plugin(SequencePlugin())
query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10")
results = searcher.search(query)
print "nb of results =", len(results)
for r in results:
print r
给出结果:
nb of results = 2
<Hit {'title': u'Sixth document'}>
<Hit {'title': u'Third document'}>
{5} 如果您想将模糊搜索设置为默认设置而不在查询的每个单词中使用语法 word~n
,您可以初始化 QueryParser
像这样:
from whoosh.query import FuzzyTerm
parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)
现在您可以使用查询 "letter stamp mail"~10
,但请记住 FuzzyTerm
具有默认编辑距离 maxdist = 1
。如果您想要更大的编辑距离,请个性化 class:
class MyFuzzyTerm(FuzzyTerm):
def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True):
super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)
# super().__init__() for Python 3 I think
参考文献: