将“-”转换为 AND 的快速搜索查询

Question

我正在尝试使用 whoosh 进行文本搜索。

当我搜索包含 - 的字符串（例如：'IGF-1R'）时，它最终会搜索 'IGF' AND '1R'，因此不会将其视为一个字符串。

知道为什么吗？

这是我使用的代码：

class MyFuzzyTerm(FuzzyTerm):
     def __init__(self, fieldname, text, boost=1.0, maxdist=1, prefixlength=2, constantscore=True):
          super(MyFuzzyTerm, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore)

with ix.searcher() as searcher:
    qp = QueryParser("gene", schema=ix.schema, termclass=MyFuzzyTerm)
    q = qp.parse('IGF-1R')

q returns:

And([MyFuzzyTerm('gene', 'igf', boost=1.000000, maxdist=1, prefixlength=2), MyFuzzyTerm('gene', '1r', boost=1.000000, maxdist=1, prefixlength=2)])

我希望它是：

MyFuzzyTerm('gene', 'igf-1r', boost=1.000000, maxdist=1, prefixlength=2)

Answer 1

将文本分成单词是分词器的工作，我通常使用 whoosh.analysis.SpaceSeparatedTokenizer() 但对于您的情况，分词器是基于 space 和破折号进行分词的。
所以我敢打赌你在 charmap 或 whoosh.analysis.RegexTokenizer(expression=<_sre.SRE_Pattern object>, gaps=False) 中使用 whoosh.analysis.CharsetTokenizer(charmap) 和 (space, dash)。

将“-”转换为 AND 的快速搜索查询

whoosh search query with converting '-' into AND

python

tokenize

whoosh