如何使用elastic4s为索引写mapping/settings？

Question

PUT /new_index/
{
    "settings": {
        "index": {
            "type": "default"
        },
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "analysis": {
            "filter": {
                "ap_stop": {
                    "type": "stop",
                    "stopwords_path": "stoplist.txt"
                },
                "shingle_filter" : {
                    "type" : "shingle",
                    "min_shingle_size" : 2,
                    "max_shingle_size" : 5,
                    "output_unigrams": true
                }
            },
        "analyzer": {
             "aplyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["standard",
                           "ap_stop",
                           "lowercase",
                           "shingle_filter",
                           "snowball"]
                }
            }
        }
    }
}

PUT /new_index/document/_mapping/
{
    "document": {
        "properties": {
            "text": {
                "type": "string",
                "store": true,
                "index": "analyzed",
                "term_vector": "with_positions_offsets_payloads",
                "search_analyzer": "aplyzer",
                "index_analyzer": "aplyzer"
            },
            "original_text": {
                "include_in_all": false,
                "type": "string",
                "store": false,
                "index": "not_analyzed"
            },
            "docid": {
                "include_in_all": false,
                "type": "string",
                "store": true,
                "index": "not_analyzed"  
            }
        }
    }
}

我需要将上面的索引 settings 和 mappings 转换为 elastic4s 接受的类型。我正在使用最新的 elastic4s 和 elasticsearch 1.5.2.

我查看了文档中给出的一些示例，但我无法弄清楚如何去做，就像我尝试以这种方式创建它一样：

client.execute {
    create index "new_index" mappings {
      "documents" as (
        "text" typed StringType analyzer ...
        )
    }
  }

我不知道如何使用 PUT 请求中给出的 store、index、term_vectors 等。

更新： 根据答案，我能够做出这样的事情：

create index "new_index" shards 5 replicas 1 refreshInterval "90s"  mappings {
    "documents" as(
      id typed StringType analyzer KeywordAnalyzer store true includeInAll false,
      "docid" typed StringType index "not_analyzed" store true includeInAll false,
      "original_text" typed StringType index "not_analyzed" includeInAll false,
      "text" typed StringType analyzer CustomAnalyzer("aplyzer") indexAnalyzer "aplyzer" searchAnalyzer "aplyzer" store true termVector WithPositionsOffsetsPayloads
      )
  } analysis (
    CustomAnalyzerDefinition(
      "aplyzer",
      StandardTokenizer,
      LowercaseTokenFilter,
      shingle tokenfilter "shingle_filter" minShingleSize 2 maxShingleSize 5 outputUnigrams true
    )
  )

我现在想不通的是如何将雪球词干分析器和停用词文件路径添加到 aplyzer 分析器？

我该怎么办？

Answer 1

您的标题询问的是自定义过滤器，但您的问题 body 询问的是 store、index 和 term_vectors。我会解释后者。

  client.execute {
    create index "myindex" mappings {
      "mytype" as (
        "myfield" typed StringType store true termVector termVector.WithOffsets index "not_analyzed"
        )
      )
    }
  }

更新:

基于您更新的问题。 elasticsearch 文档不清楚是否可以在滚雪球令牌过滤器上设置停用词。你可以在雪球分析仪上。

所以，要么

SnowballAnalyzerDefinition("mysnowball", "English", stopwords = Set("I", "he", "the"))

或

CustomAnalyzerDefinition("mysnowball",
  StandardTokenizer,
  LowercaseTokenFilter,
  snowball tokenfilter "snowball1" language "German"
)

Answer 2

根据@monkjack 的建议以及我从 elastic4s 的文档中阅读的内容，我最终得出以下答案，说明与 elastic4s 一起使用时索引设置和映射的外观。浏览一下作者为API.

写的tests

create index "new_index" shards 5 replicas 1 refreshInterval "90s" mappings {
    "documents" as(
      id
        typed StringType
        analyzer KeywordAnalyzer
        store true
        includeInAll false,
      "docid"
        typed StringType
        index "not_analyzed"
        store true
        includeInAll false,
      "original_text"
        typed StringType
        index "not_analyzed"
        includeInAll false,
      "text"
        typed StringType
        analyzer CustomAnalyzer("aplyzer")
        indexAnalyzer "aplyzer"
        searchAnalyzer "aplyzer"
        store true
        termVector WithPositionsOffsetsPayloads
      )
  } analysis (
    CustomAnalyzerDefinition(
      "aplyzer",
      StandardTokenizer,
      LowercaseTokenFilter,
      NamedStopTokenFilter("ap_stop", "_english_", true, true),
      shingle
        tokenfilter "shingle_filter"
        minShingleSize 2
        maxShingleSize 5
        outputUnigrams true
        outputUnigramsIfNoShingles true,
      snowball
        tokenfilter "ap_snowball"
        lang "English"
    )
  )

如果您想提供自己的停用词列表，请使用 StopTokenFilter("ap_stop", stopwords = Set("a", "an", "the")) 代替 NamedStopTokenFilter。

当我在 Sense 中运行 GET new_index 时，我得到以下 setting/mapping。

{
   "new_index": {
      "aliases": {},
      "mappings": {
         "documents": {
            "properties": {
               "docid": {
                  "type": "string",
                  "index": "not_analyzed",
                  "store": true,
                  "include_in_all": false
               },
               "original_text": {
                  "type": "string",
                  "index": "not_analyzed",
                  "include_in_all": false
               },
               "text": {
                  "type": "string",
                  "store": true,
                  "term_vector": "with_positions_offsets_payloads",
                  "analyzer": "aplyzer"
               }
            }
         }
      },
      "settings": {
         "index": {
            "creation_date": "1433383476240",
            "uuid": "6PmqlY6FRPanGtVSsGy3Jw",
            "analysis": {
               "analyzer": {
                  "aplyzer": {
                     "type": "custom",
                     "filter": [
                        "lowercase",
                        "ap_stop",
                        "shingle_filter",
                        "ap_snowball"
                     ],
                     "tokenizer": "standard"
                  }
               },
               "filter": {
                  "ap_stop": {
                     "enable_position_increments": "true",
                     "ignore_case": "true",
                     "type": "stop",
                     "stopwords": "_english_"
                  },
                  "shingle_filter": {
                     "output_unigrams_if_no_shingles": "true",
                     "token_separator": " ",
                     "max_shingle_size": "5",
                     "type": "shingle",
                     "min_shingle_size": "2",
                     "filler_token": "_",
                     "output_unigrams": "true"
                  },
                  "ap_snowball": {
                     "type": "snowball",
                     "language": "English"
                  }
               }
            },
            "number_of_replicas": "1",
            "number_of_shards": "5",
            "refresh_interval": "90s",
            "version": {
               "created": "1050299"
            }
         }
      },
      "warmers": {}
   }
}

如果您希望 StopWords 和 Stemmers 作为单独的分析器，正如@monkjack 建议的那样，只需添加 SnowballAnalyzerDefinition 和 StopAnalyzerDefinition，例如：

....outputUnigramsIfNoShingles true,
    ),
    SnowballAnalyzerDefinition("ap_snowball", "English"),
    StopAnalyzerDefinition("ap_stop", stopwords = Set("a", "an", "the"))
  )

如何使用elastic4s为索引写mapping/settings？

How to write mapping/settings for an index using elastic4s?

scala

elasticsearch

elastic4s