1.2 和 1.4 之间使用英语词干分析器处理所有格(撇号)的差异
Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4
我们有两个弹性搜索实例,一个运行 1.2.1,一个运行 1.4,两个实例上运行的索引的设置和映射相同,但结果不同。
默认分析器的设置:
....
analysis: {
filter: {
ourEnglishStopWords: {
type: "stop",
stopwords: "_english_"
},
ourEnglishFilter: {
type: "stemmer",
name: "english"
}
},
analyzer: {
default: {
filter: [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"ourEnglishFilter"
],
tokenizer: "standard"
}
}
},
...
弹性搜索版本之间的差异出现在 indexing/searching 用于所有格形式时,
而在 1.2.1 中 "player"、"players" 和 "player's" 会 return 相同的结果,在 1.4 中
前两个("player" 和 "players")具有相同的结果集,而 "player's" 与该集不匹配
这是一个已知的区别吗?在 1.4 及更高版本中获得相同行为的正确方法是什么?
我认为 this 是变化,在 1.3.0 中引入:
The StemmerTokenFilter had a number of issues:
- english returned the slow snowball English stemmer
- porter2 returned the snowball Porter stemmer (v1)
Changes:
- english now returns the fast PorterStemmer (for indices created from
v1.3.0 onwards)
- porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
根据 github 问题,您可以将映射更改为:
"ourEnglishFilter": {
"type": "stemmer",
"name": "porter2"
}
或尝试其他方法:
"filter": {
"ourEnglishStopWords": {
"type": "stop",
"stopwords": "_english_"
},
"ourEnglishFilter": {
"type": "stemmer",
"name": "english"
},
"possesiveEnglish": {
"type": "stemmer",
"name": "possessive_english"
}
},
"analyzer": {
"default": {
"filter": [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"possesiveEnglish",
"ourEnglishFilter"
],
"tokenizer": "standard"
}
}
我们有两个弹性搜索实例,一个运行 1.2.1,一个运行 1.4,两个实例上运行的索引的设置和映射相同,但结果不同。
默认分析器的设置:
....
analysis: {
filter: {
ourEnglishStopWords: {
type: "stop",
stopwords: "_english_"
},
ourEnglishFilter: {
type: "stemmer",
name: "english"
}
},
analyzer: {
default: {
filter: [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"ourEnglishFilter"
],
tokenizer: "standard"
}
}
},
...
弹性搜索版本之间的差异出现在 indexing/searching 用于所有格形式时, 而在 1.2.1 中 "player"、"players" 和 "player's" 会 return 相同的结果,在 1.4 中 前两个("player" 和 "players")具有相同的结果集,而 "player's" 与该集不匹配 这是一个已知的区别吗?在 1.4 及更高版本中获得相同行为的正确方法是什么?
我认为 this 是变化,在 1.3.0 中引入:
The StemmerTokenFilter had a number of issues:
- english returned the slow snowball English stemmer
- porter2 returned the snowball Porter stemmer (v1)
Changes:
- english now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
- porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)
根据 github 问题,您可以将映射更改为:
"ourEnglishFilter": {
"type": "stemmer",
"name": "porter2"
}
或尝试其他方法:
"filter": {
"ourEnglishStopWords": {
"type": "stop",
"stopwords": "_english_"
},
"ourEnglishFilter": {
"type": "stemmer",
"name": "english"
},
"possesiveEnglish": {
"type": "stemmer",
"name": "possessive_english"
}
},
"analyzer": {
"default": {
"filter": [
"asciifolding",
"lowercase",
"ourEnglishStopWords",
"possesiveEnglish",
"ourEnglishFilter"
],
"tokenizer": "standard"
}
}