使用 Lucene 在长文本中搜索名称
Search names inside a long text using Lucene
我有 Lucene 索引包含如下名称:
- 道格拉斯·亚当斯
- 亚当斯·桑德勒斯
- 亚当斯
等..
当我想搜索一个名字时,这很容易。但是,我有一些消息需要搜索以检查它是否包含这些名称中的任何一个。它们相当长,例如:
Radio producer Dirk Maggs had consulted with Adams, first in 1993, and later in 1997 and 2000 about creating a third radio series, based on the third novel in the Hitchhiker's series.[21] They also discussed the possibilities of radio adaptations of the final two novels in the five-book "trilogy". As with the movie, this project was only realised after Adams's death. The third series, The Tertiary Phase, was broadcast on BBC Radio 4 in September 2004 and was subsequently released on audio CD. With the aid of a recording of his reading of Life, the Universe and Everything and editing, Adams can be heard playing the part of Agrajag posthumously. So Long, and Thanks for All the Fish and Mostly Harmless made up the fourth and fifth radio series, respectively (on radio they were titled The Quandary Phase and The Quintessential Phase) and these were broadcast in May and June 2005, and also subsequently released on Audio CD. The last episode in the last series (with a new, "more upbeat" ending) concluded with, "The very final episode of The Hitchhiker's Guide to the Galaxy by Douglas Adams is affectionately dedicated to its author.
问题是这是消息,我需要形成一个查询或一组查询,并且需要找到索引的名称。
我试着分别查看每个术语,但它会产生很多误报,找到包含任何术语的所有名称。
对于上面的文本,它应该与 "adams" 条目匹配,也应该与 "douglas adams" 条目匹配,而不是 "adams sandlers" 。如您所见,这就像是在寻找相反的方式,就像在文本中搜索每个条目一样,但不幸的是它恰恰相反。
有人知道怎么处理吗??我不期待一个确切的解决方案,但任何想法将不胜感激。
这是一个相当简单的方法。
1) Index all your names in Lucene (you've already done this)
2) Fire entire phrase as a query (field: Radio producer Dirk Maggs .......)
3) Get all matched documents/results from Lucene and post process them (you will get doughlas adams, adams sandlers, adams as your top docs)
4) During post processing start with each of matched document, take each term of document and match thru each term of your query, if all terms of your document are found in query consider this document ELSE discard the document (by doing this you are discarding "adam sandlers") - this will be O(n^2) execution.
5) Done
#4会有点贵,如果你有执行时间问题可以优化。
我不太确定如何在 Solr 中添加自定义 post 处理逻辑,但我确信它是可能的。
您也可以创建自定义收集器并在其中添加此逻辑,但如果您有大量文档,执行速度会非常慢。
我有 Lucene 索引包含如下名称:
- 道格拉斯·亚当斯
- 亚当斯·桑德勒斯
- 亚当斯
等..
当我想搜索一个名字时,这很容易。但是,我有一些消息需要搜索以检查它是否包含这些名称中的任何一个。它们相当长,例如:
Radio producer Dirk Maggs had consulted with Adams, first in 1993, and later in 1997 and 2000 about creating a third radio series, based on the third novel in the Hitchhiker's series.[21] They also discussed the possibilities of radio adaptations of the final two novels in the five-book "trilogy". As with the movie, this project was only realised after Adams's death. The third series, The Tertiary Phase, was broadcast on BBC Radio 4 in September 2004 and was subsequently released on audio CD. With the aid of a recording of his reading of Life, the Universe and Everything and editing, Adams can be heard playing the part of Agrajag posthumously. So Long, and Thanks for All the Fish and Mostly Harmless made up the fourth and fifth radio series, respectively (on radio they were titled The Quandary Phase and The Quintessential Phase) and these were broadcast in May and June 2005, and also subsequently released on Audio CD. The last episode in the last series (with a new, "more upbeat" ending) concluded with, "The very final episode of The Hitchhiker's Guide to the Galaxy by Douglas Adams is affectionately dedicated to its author.
问题是这是消息,我需要形成一个查询或一组查询,并且需要找到索引的名称。
我试着分别查看每个术语,但它会产生很多误报,找到包含任何术语的所有名称。
对于上面的文本,它应该与 "adams" 条目匹配,也应该与 "douglas adams" 条目匹配,而不是 "adams sandlers" 。如您所见,这就像是在寻找相反的方式,就像在文本中搜索每个条目一样,但不幸的是它恰恰相反。
有人知道怎么处理吗??我不期待一个确切的解决方案,但任何想法将不胜感激。
这是一个相当简单的方法。
1) Index all your names in Lucene (you've already done this)
2) Fire entire phrase as a query (field: Radio producer Dirk Maggs .......)
3) Get all matched documents/results from Lucene and post process them (you will get doughlas adams, adams sandlers, adams as your top docs)
4) During post processing start with each of matched document, take each term of document and match thru each term of your query, if all terms of your document are found in query consider this document ELSE discard the document (by doing this you are discarding "adam sandlers") - this will be O(n^2) execution.
5) Done
#4会有点贵,如果你有执行时间问题可以优化。
我不太确定如何在 Solr 中添加自定义 post 处理逻辑,但我确信它是可能的。
您也可以创建自定义收集器并在其中添加此逻辑,但如果您有大量文档,执行速度会非常慢。