如何在 Elasticsearch "Phonetic Token filter" 中决定为哪种语言使用哪个编码器?

How to decide which Encoder to use for which language in Elasticsearch "Phonetic Token filter"?

我在 Elasticsearch 中使用 Metaphonesoundex 编码器 "Phonetic Token Filter"。

Metaphone 适合英文单词。

Soundex 适用于 英语 以及 印地语 也许很多还有其他语言


因为 Elasticsearch website 中没有列出我们应该为哪种语言选择哪种编码器。




  1. Metaphone, Double Metaphone, and Metaphone 3 : suitable for use with most English words, not just names. Metaphone algorithms are the basis for many popular spell checkers. The Double Metaphone phonetic encoding algorithm is the second generation of this algorithm.
  2. Soundex: which was developed to encode surnames for use in censuses. Soundex codes are four-character strings composed of a single letter followed by three numbers.
  3. Daitch–Mokotoff Soundex: which is a refinement of Soundex designed to better match surnames of Slavic and Germanic origin. Daitch–Mokotoff Soundex codes are strings composed of six numeric digits.
  4. Cologne phonetics :This is similar to Soundex, but more suitable for German words.
  5. New York State Identification and Intelligence System (NYSIIS): which maps similar phonemes to the same letter. The result is a string that can be pronounced by the reader without decoding.
  6. Match Rating Approach developed by Western Airlines in 1977: this algorithm has an encoding and range comparison technique.
  7. Caverphone: created to assist in data matching between late 19th century and early 20th century electoral rolls, optimized for accents present in parts of New Zealand

参考资料: 上述算法及其子类型的详细信息可在下面的维基百科页面中找到 1. https://en.wikipedia.org/wiki/Phonetic_algorithm

以上SoundEx最适合印度语言 您可以在下面查看相同的资源 1. 2.https://thottingal.in/blog/2009/07/26/indicsoundex/