More_like_this elasticsearch 是如何工作的（进入整个索引）

Question

所以首先我们得到一个包含所有标记的 termVectors 列表，然后我们创建一个 map<token, frequency in the document>. 然后 createQueue 方法将通过删除、stopWords 和出现不足的词来确定分数，计算 idf，然后是给定标记的 idf * doc_frequency 等于它的标记，然后我们保留 25 个最好的，但是之后它是如何工作的？与整个指数相比如何？我读了 http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ 但那并没有解释它，或者我错过了重点。

Answer 1

它从每个术语中创建一个 TermQuery，并将它们全部放入一个简单的 BooleanQuery，通过先前计算的 tfidf 分数（boostFactor * myScore / bestScore，其中boostFactor 可以由用户设置）。

这里是the source (version 5.0):

private Query createQuery(PriorityQueue<ScoreTerm> q) {
  BooleanQuery query = new BooleanQuery();
  ScoreTerm scoreTerm;
  float bestScore = -1;

  while ((scoreTerm = q.pop()) != null) {
    TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));

    if (boost) {
      if (bestScore == -1) {
        bestScore = (scoreTerm.score);
      }
      float myScore = (scoreTerm.score);
      tq.setBoost(boostFactor * myScore / bestScore);
    }

    try {
      query.add(tq, BooleanClause.Occur.SHOULD);
    }
    catch (BooleanQuery.TooManyClauses ignore) {
      break;
    }
  }
  return query;
}

More_like_this elasticsearch 是如何工作的（进入整个索引）

how does More_like_this elasticsearch work (into the whole index)

lucene

indexing

comparison

elasticsearch

morelikethis