处理过的令牌如何存储在 Vespa 的基本索引中？

Question

在使用类似于

的搜索定义时

search music{
    document music{
        field title type string {
            indexing: summary | attribute | index
        }
    }
}

如果我通过开发文档处理器使用我的自定义字符串标记逻辑（我将处理过的标记保存在处理的上下文中），如何将标记存储在基本索引中？以及如何将它们映射回字段的原始内容，同时召回特定查询？我们通过ProcessingEndPoint解决吗？如果是，如何？

Answer 1

首先，您几乎肯定应该为此字段删除 "attribute" - "attribute" 意味着除了创建索引以供搜索之外，文本还将存储在内存中的前向存储中。这可能对用于排序、分组和排名的结构化数据有用，但对 free-text 字段无效。

不必要的细节：

您可以通过添加文档处理器组件来执行自己的文档处理：http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html 在 Vespa 中执行此操作的代码（由文档处理器调用）是 https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html

但是，有一种更简单的方法可以做到这一点：您可以按照此处所述插入自己的分词器：http://docs.vespa.ai/documentation/linguistics.html：创建您自己的子类化 SimpleLinguistics 组件并将 getTokenizer 重写为 return 您的实现。这将由 Vespa 根据需要在文档处理和查询端执行。

这样做的原因通常是为英语以外的其他语言提供语言学。如果您这样做，请考虑将您的语言学代码提供回 Vespa。

处理过的令牌如何存储在 Vespa 的基本索引中？

How processed tokens get stored in base index in Vespa?

vespa