Elasticsearch:如何存储术语向量

Elasticsearch: How to store term vectors

我正在从事一个项目,我大量使用 Elasticsearch 并利用 moreLikeThis 查询来实现一些功能。 MLT 查询的官方文档说明如下:

In order to speed up analysis, it could help to store term vectors at index time, but at the expense of disk usage.

在**工作原理*部分。现在的想法是调整映射,以便存储预先计算的术语向量。问题是从文档中似乎不清楚应该如何完成。一方面,在 MLT 文档中,他们提供了如下所示的示例映射:

curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
  "mappings": {
    "movies": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "yes"
         },
         "description": {
          "type": "string"
        },
        "tags": {
          "type": "string",
          "fields" : {
            "raw": {
              "type" : "string",
              "index" : "not_analyzed",
              "term_vector" : "yes"
            }
          }
        }
      }
    }
  }
}

另一方面,在 Term Vectors documentation 中,他们在 示例 1 部分中提供了一个映射,如下所示

curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "index_analyzer" : "fulltext_analyzer"
        }
      }
    }
    ....

这应该create an index that stores term vectors, payloads etc.

现在的问题是:应该使用哪个映射?是文档中的缺陷还是我遗漏了什么?

你是对的,当前版本的文档中似乎没有明确提及,但是在即将发布的版本中2.0 documents会有更详细的解释。

Term vectors contain information about the terms produced by the analysis process, including:

  • a list of terms.
  • the position (or order) of each term.
  • the start and end character offsets mapping the term to its origin in the original string.

These term vectors can be stored so that they can be retrieved for a particular document.

The term_vector setting accepts:

  • no: No term vectors are stored. (default)
  • yes: Just the terms in the field are stored
  • with_positions: Terms and positions are stored
  • with_offsets: Terms and character offsets are stored
  • with_positions_offsets: Terms, positions, and character offsets are stored