精确匹配一个词组
whoosh exact match for a phrase
我想在文档中查找短语,我已经使用了快速入门中的代码。
>>> from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
>>> ix = create_in("indexdir", schema)
>>> writer = ix.writer()
>>> writer.add_document(title=u"First document", path=u"/a", content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Second document", path=u"/b", content=u"The second one is even more interesting!")
>>> writer.commit()
>>> from whoosh.qparser import QueryParser
>>> with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse("first")
results = searcher.search(query)
results[0]
result: {"title": u"First document", "path": u"/a"}
但是后来我发现他们会把关键词拆分成几个单独的词然后搜索文档。
如果我想搜索像"the first guy here in the document"这样的短语,我应该怎么做。
文档上说,使用
"it is a phrase"
如果我要搜索:
it is a phrase.
这让我很困惑。
此外,这里有一个class,似乎可以帮助我,但我不知道如何使用它。
class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
Matches documents containing a given phrase.
更新:
我是这样用的,没有匹配到。
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", path=u"/a",
content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", path=u"/b",
content=u"The second one is even more interesting!")
writer.commit()
from whoosh.query import Phrase
a = Phrase("content", u"the first")
results = ix.searcher().search(a)
print results
结果:
Top 0 Results for Phrase('content', u'the first', slop=1,
boost=1.000000) runtime=0.0>
根据其他更新
with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse(**'"first x document"'**)
results = searcher.search(query)
print results[0]
result : Hit {'content': u"This is the first document we've added!",
'path': u'/a', 'title': u'First document'}>
我认为应该没有匹配的结果,因为文档中没有"first x document"。否则不是完全匹配。
你应该给 Phrase
一个 list
个单词而不是字符串作为第二个参数,并且还要删除 the 因为它是一个停用词:
a = Phrase("content", [u"first",u"document"])
而不是
a = Phrase("content", u"the first")
阅读文档:
class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
Matches documents containing a given phrase.
Parameters:
fieldname – the field to search.
words – a list of words (unicode strings) in the phrase.
在 whoosh 中短语搜索的自然用法是在 QueryParser
:
中使用 Quotes " "
>>> with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse('"first document"')
results = searcher.search(query)
results[0]
Update: "first x document"
匹配的是因为x
和所有的单字符词都是停用词,被过滤了。
要在内容中查找短语,请在定义 Schema 时使用 phrase=True
,如下所示
schema = Schema(title=TEXT(stored=True), content=TEXT(phrase=True))
然后只需在单引号内使用双引号,如下所示
query = QueryParser("content", schema=ix.schema).parse('"exact phrase"')
我想在文档中查找短语,我已经使用了快速入门中的代码。
>>> from whoosh.index import create_in
>>> from whoosh.fields import *
>>> schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
>>> ix = create_in("indexdir", schema)
>>> writer = ix.writer()
>>> writer.add_document(title=u"First document", path=u"/a", content=u"This is the first document we've added!")
>>> writer.add_document(title=u"Second document", path=u"/b", content=u"The second one is even more interesting!")
>>> writer.commit()
>>> from whoosh.qparser import QueryParser
>>> with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse("first")
results = searcher.search(query)
results[0]
result: {"title": u"First document", "path": u"/a"}
但是后来我发现他们会把关键词拆分成几个单独的词然后搜索文档。 如果我想搜索像"the first guy here in the document"这样的短语,我应该怎么做。
文档上说,使用
"it is a phrase"
如果我要搜索:
it is a phrase.
这让我很困惑。
此外,这里有一个class,似乎可以帮助我,但我不知道如何使用它。
class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)
Matches documents containing a given phrase.
更新: 我是这样用的,没有匹配到。
from whoosh.index import create_in
from whoosh.fields import *
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
ix = create_in("indexdir", schema)
writer = ix.writer()
writer.add_document(title=u"First document", path=u"/a",
content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", path=u"/b",
content=u"The second one is even more interesting!")
writer.commit()
from whoosh.query import Phrase
a = Phrase("content", u"the first")
results = ix.searcher().search(a)
print results
结果:
Top 0 Results for Phrase('content', u'the first', slop=1, boost=1.000000) runtime=0.0>
根据其他更新
with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse(**'"first x document"'**)
results = searcher.search(query)
print results[0]
result : Hit {'content': u"This is the first document we've added!", 'path': u'/a', 'title': u'First document'}>
我认为应该没有匹配的结果,因为文档中没有"first x document"。否则不是完全匹配。
你应该给 Phrase
一个 list
个单词而不是字符串作为第二个参数,并且还要删除 the 因为它是一个停用词:
a = Phrase("content", [u"first",u"document"])
而不是
a = Phrase("content", u"the first")
阅读文档:
class whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) Matches documents containing a given phrase.
Parameters:
fieldname – the field to search.
words – a list of words (unicode strings) in the phrase.
在 whoosh 中短语搜索的自然用法是在 QueryParser
:
" "
>>> with ix.searcher() as searcher:
query = QueryParser("content", ix.schema).parse('"first document"')
results = searcher.search(query)
results[0]
Update: "first x document"
匹配的是因为x
和所有的单字符词都是停用词,被过滤了。
要在内容中查找短语,请在定义 Schema 时使用 phrase=True
,如下所示
schema = Schema(title=TEXT(stored=True), content=TEXT(phrase=True))
然后只需在单引号内使用双引号,如下所示
query = QueryParser("content", schema=ix.schema).parse('"exact phrase"')