在 Vespa 引擎中拆分用户查询的字符

Question

我们在 ascii 空间上拆分用户查询以创建 weakAnd(...)。

用户输入 "Watch【Docudrama】" 不包含空格 - 但会引发错误。

问题：应该使用空格旁边的哪些代码点来拆分查询？

YQL（失败）：

select * from post where text contains "Watch【Docudrama】" limit 1;

YQL（有效）：

select * from post where weakAnd(text contains "Watch",text contains "【Docudrama】") limit 1;

错误信息：

{
  "root": {
    "id": "toplevel",
    "relevance": 1,
    "fields": {
      "totalCount": 0
    },
    "errors": [
      {
        "code": 4,
        "summary": "Invalid query parameter",
        "source": "content",
        "message": "Can not add WORD_ALTERNATIVES text:[ Watch【Docudrama】(1.0) watch(0.7) ] to a segment phrase"
      }
    ]
  }
}

Answer 1

您确定需要为此使用 WAND 吗？尝试将用户查询语法设置为 "any"（默认为 "all"），这将对用户提供的术语使用 "OR" 运算符。这里有一个例子：https://docs.vespa.ai/documentation/reference/query-language-reference.html#userinput

拆分查询的过程称为标记化。这是一个复杂且依赖于语言的过程，Vespa 使用 Apache OpenNLP 来执行此操作（以及更多）：https://docs.vespa.ai/documentation/linguistics.html 有更多信息以及对执行此操作的代码的引用。

如果你真的想使用 WAND，而不是在 Vespa 之外重新实现查询解析逻辑，我建议你创建一个 Java 搜索器，它下降查询树并通过用 WeakAndItem 替换创建的 AndItem 来修改它。参见 https://docs.vespa.ai/documentation/searcher-development.html and the code example here: https://docs.vespa.ai/documentation/advanced-ranking.html

在 Vespa 引擎中拆分用户查询的字符

Characters to split the user-query in Vespa engine

vespa