1.2 和 1.4 之间使用英语词干分析器处理所有格(撇号)的差异

Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4

我们有两个弹性搜索实例,一个运行 1.2.1,一个运行 1.4,两个实例上运行的索引的设置和映射相同,但结果不同。

默认分析器的设置:

....
analysis: {
 filter: {
  ourEnglishStopWords: {
   type: "stop",
   stopwords: "_english_"
  },
  ourEnglishFilter: {
   type: "stemmer",
   name: "english"
  }
 },
 analyzer: {
  default: {
   filter: [
    "asciifolding",
    "lowercase",
    "ourEnglishStopWords",
    "ourEnglishFilter"
   ],
   tokenizer: "standard"
  }
 }
},
...

弹性搜索版本之间的差异出现在 indexing/searching 用于所有格形式时, 而在 1.2.1 中 "player"、"players" 和 "player's" 会 return 相同的结果,在 1.4 中 前两个("player" 和 "players")具有相同的结果集,而 "player's" 与该集不匹配 这是一个已知的区别吗?在 1.4 及更高版本中获得相同行为的正确方法是什么?

我认为 this 是变化,在 1.3.0 中引入:

The StemmerTokenFilter had a number of issues:

  1. english returned the slow snowball English stemmer
  2. porter2 returned the snowball Porter stemmer (v1)

Changes:

  1. english now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
  2. porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)

根据 github 问题,您可以将映射更改为:

    "ourEnglishFilter": {
      "type": "stemmer",
      "name": "porter2"
    }

或尝试其他方法:

 "filter": {
    "ourEnglishStopWords": {
      "type": "stop",
      "stopwords": "_english_"
    },
    "ourEnglishFilter": {
      "type": "stemmer",
      "name": "english"
    },
    "possesiveEnglish": {
      "type": "stemmer",
      "name": "possessive_english"
    }
  },
  "analyzer": {
    "default": {
      "filter": [
        "asciifolding",
        "lowercase",
        "ourEnglishStopWords",
        "possesiveEnglish",
        "ourEnglishFilter"
      ],
      "tokenizer": "standard"
    }
  }