使用弹性搜索的具有频率数的 N-Grams

Question

我使用 n-grams 分词器在 elasticsearch 中创建了 n-gram，但我无法检索每个 gram 的频率，无论是 bi-gram 还是 tri-gram。我该怎么做？

Answer 1

这是我在另一个 SO 答案中使用术语向量的一些代码：

http://sense.qbox.io/gist/3092992993e0328f7c4ee80e768dd508a0bc053f

举个简单的例子，如果我按如下方式设置一个专为自动完成设计的索引：

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "autocomplete": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "standard",
                  "stop",
                  "kstem",
                  "edgengram_filter"
               ]
            }
         },
         "filter": {
            "edgengram_filter": {
               "type": "edgeNGram",
               "min_gram": 2,
               "max_gram": 15
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "content": {
               "type": "string",
               "index_analyzer": "autocomplete",
               "search_analyzer": "standard",
               "term_vector": "yes"
            }
         }
      }
   }
}

然后添加几个简单的文档：

POST test_index/doc/_bulk
{"index":{"_id":1}}
{"content":"hello world"}
{"index":{"_id":2}}
{"content":"goodbye world"}

我可以像这样查看单个文档的词频：

GET /test_index/doc/1/_termvector

哪个returns:

{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "took": 1,
   "term_vectors": {
      "content": {
         "field_statistics": {
            "sum_doc_freq": 8,
            "doc_count": 1,
            "sum_ttf": 8
         },
         "terms": {
            "he": {
               "term_freq": 1
            },
            "hel": {
               "term_freq": 1
            },
            "hell": {
               "term_freq": 1
            },
            "hello": {
               "term_freq": 1
            },
            "wo": {
               "term_freq": 1
            },
            "wor": {
               "term_freq": 1
            },
            "worl": {
               "term_freq": 1
            },
            "world": {
               "term_freq": 1
            }
         }
      }
   }
}

在生产中小心使用术语向量，因为它们确实会增加一些开销。不过对于测试非常有用。

编辑： 如果您要查找整个索引的词频，只需使用 terms aggregation.

Answer 2

从你的问题中并不清楚你到底想做什么。 post 通常最好使用您尝试过的代码，并尽可能具体地描述您的问题。

无论如何，我认为这段代码将接近于执行您想要的操作：

http://sense.qbox.io/gist/f357f15360719299ac556e8082afe26e4e0647d1

我从 this answer, then refined some using the information in the docs for shingle token filters 中的代码开始。这是我最终得到的映射：

PUT /test_index
{
   "settings": {
      "analysis": {
         "analyzer": {
            "evolutionAnalyzer": {
               "tokenizer": "standard",
               "filter": [
                  "standard",
                  "lowercase",
                  "custom_shingle"
               ]
            }
         },
         "filter": {
            "custom_shingle": {
               "type": "shingle",
               "min_shingle_size": "2",
               "max_shingle_size": "3",
               "filler_token": "",
               "output_unigrams": true
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "content": {
               "type": "string",
               "index_analyzer": "evolutionAnalyzer",
               "search_analyzer": "standard",
               "term_vector": "yes"
            }
         }
      }
   }
}

再次强调，在生产中使用术语向量要小心。

使用弹性搜索的具有频率数的 N-Grams

N-Grams with frequency number using elasticsearch

stringtokenizer

n-gram

elasticsearch