索引文档后术语被截断(Elasticsearch)

Terms get truncated after indexing document (Elasticsearch)

我是 elasticsearch 的新手,我所做的只是索引一些文档。然后在检索术语向量时,我注意到有很多术语被截断了,这是一个小例子

        "nationallypublic": {
           "term_freq": 1,
           "tokens": [
              {
                 "position": 496,
                 "start_offset": 3126,
                 "end_offset": 3146
              }
           ]
        },
        "natur": {
           "term_freq": 1,
           "tokens": [
              {
                 "position": 60,
                 "start_offset": 373,
                 "end_offset": 380
              }
           ]
        },

这些是文档的一些节选,这一篇自然包含

are some of the filmmakers ofthe 80s Its natural said Robert Friedman the senior vicepresident of worldwide advertising and publicity at Warner Bros

还有一个全国公开的(我知道这是个错误的词,但即使这样也应该完整地包括在内)应该在全国公开

They were reported missing on June 21 several hours after beingstopped for speeding near Philadelphia Miss After a nationallypublicized search their bodies were discovered Aug 4 on a farmjust outside the town

请问我是不是做错了什么?这是我的设置和映射

{
   "ap1": {
      "mappings": {
         "document": {
            "properties": {
               "docno": {
                  "type": "string",
                  "index": "not_analyzed",
                  "store": true
               },
               "text": {
                  "type": "string",
                  "store": true,
                  "term_vector": "with_positions_offsets_payloads",
                  "analyzer": "my_english"
               }
            }
         }
      },
      "settings": {
         "index": {
            "creation_date": "1422144472984",
            "uuid": "QzT_sx4aRWOXGlEs2ATibw",
            "analysis": {
               "analyzer": {
                  "my_english": {
                     "type": "english",
                     "stopwords": "_none_"
                  }
               }
            },
            "store": {
               "type": "default"
            },
            "number_of_replicas": "0",
            "number_of_shards": "1",
            "version": {
               "created": "1040299"
            }
         }
      }
   }
}

这是词干分析器的效果。 默认情况下,雪球词干分析器也用作分析器。 词干分析器的预期行为是将单词转换为其基本形式,如下所示 -

Jumping => jump
Running = > run

等等。 snowball stemmer 使用一种算法将单词转换为其基本形式。这意味着转换可能不是很准确,因为它会将标记转换为可能表示基本形式但不完全是基本形式的单词。 因此,在索引和搜索时有效地发生了以下版本

jumping => jmp
jump    => jmp
jumped  => jmp

因此我们能够成功地进行词干提取,但在某些极端情况下这是不准确的。

您看到的标记转换不是截断,而是滚雪球算法为词干提取所做的转​​换。

如果您需要此处的准确标记,一个好主意是使用基于字典的 hunspell,因此会减慢搜索速度。