ElasticSearch 分析器：有没有办法删除索引中出现的相同单词？

Question

我正在使用带有自定义索引和搜索分析器的 ElasticSearch。我正在查询用户数据，有时，字符串中出现相同的事件。

示例："Hello World Hello Mr !" 其中您可以看到 2 次 "Hello"。

如果我正在搜索 "Hello World"，我会在 "Hello World Hello Mr !" 中获得比 "Hello World" 更好的分数。我不想要这种行为，即使它是合乎逻辑的。

那么，是否可以在索引时删除相同的单词？示例："Hello World Hello Mr !" => "Hello World Mr !"

我当前的映射和设置：

  settings index: { number_of_shards: 1, number_of_replicas: 1 }, analysis: {
    analyzer: {
      custom_analyzer: {
        tokenizer: "custom_tokenizer",
        filter: ["lowercase", "asciifolding", "custom_spliter"]
      }
    },
    filter: {
      custom_spliter: {
        type: "word_delimiter",
        preserve_original: "true"
      }
    },
    tokenizer: {
      custom_tokenizer: {
        type: "nGram",
        min_gram: "3",
        max_gram: "3",
        token_chars: [ "letter", "digit" ]
      }
    }
  } do
    mappings dynamic: 'false' do
      indexes :searchable, analyzer: "custom_analyzer"
    end
  end

可以吗？

Answer 1

您可以将 unique token filter 添加到分析器来实现此目的。
它允许进行配置，以便您可以删除出现在同一位置（例如：同义词）或任何位置的重复标记。

ElasticSearch 分析器：有没有办法删除索引中出现的相同单词？

ElasticSearch analyzer: is there a way to remove same occurrences of words in index?

elasticsearch

find-occurrences