Lucene TFIDF 不会 return 1 对于与特定文档完全相同的查询

Question

我实现了一个程序，根据给定用户输入的 TFIDF 相似度得分对文档进行排名。

程序如下：

public class Ranking{

    private static int maxHits = 10;
    private static Connection connect = null;
    private static PreparedStatement preparedStatement = null;
    private static ResultSet resultSet = null;

    public static void main(String[] args) throws Exception {        
        System.out.println("Enter your paper title: ");
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
        String paperTitle = null;
        paperTitle = br.readLine(); 

        Class.forName("com.mysql.jdbc.Driver");
        connect = DriverManager.getConnection("jdbc:mysql://localhost/arnetminer?"
                  + "user=root&password=1234");
        preparedStatement = connect.prepareStatement
        ("SELECT stoppedstemmedtitle from arnetminer.new_bigdataset "
                + "where title="+"'"+paperTitle+"';");
        resultSet = preparedStatement.executeQuery();
        resultSet.next();
        String stoppedstemmedtitle = resultSet.getString(1);

        String querystr = args.length > 0 ? args[0] :stoppedstemmedtitle;
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        Query q = new QueryParser(Version.LUCENE_42, "stoppedstemmedtitle", analyzer).parse(querystr);

        IndexReader reader = DirectoryReader.open(FSDirectory.open(new File("E:/Lucene/new_bigdataset_index")));        
        IndexSearcher searcher = new IndexSearcher(reader);

        VSMSimilarity vsmSimiliarty = new VSMSimilarity();  
        searcher.setSimilarity(vsmSimiliarty);
        TopDocs hits = searcher.search(q, maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;

        PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");

        int counter = 0;
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            System.out.println(scoreDocs[n]);
            float score = sd.score;
            int docId = sd.doc;
            Document d = searcher.doc(docId);
            String fileName = d.get("title");
            String year = d.get("pub_year");
            String paperkey = d.get("paperkey");
            System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
            writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
        ++counter;
        }
        writer.close();

    }


}

和

public class VSMSimilarity extends DefaultSimilarity{

    // Weighting codes
    public boolean doBasic     = true;  // Basic tf-idf
    public boolean doSublinear = false; // Sublinear tf-idf
    public boolean doBoolean   = false; // Boolean

    //Scoring codes
    public boolean doCosine    = true;
    public boolean doOverlap   = false;

    // term frequency in document = measure of how often a term appears in the document
    public float tf(int freq) {     

        return super.tf(freq);
    }

    // inverse document frequency = measure of how often the term appears across the index
    public float idf(int docFreq, int numDocs) {

        // The default behaviour of Lucene is 1 + log (numDocs/(docFreq+1)), which is what we want (default VSM model)
        return super.idf(docFreq, numDocs); 
    }

    // normalization factor so that queries can be compared 
    public float queryNorm(float sumOfSquaredWeights){

        return super.queryNorm(sumOfSquaredWeights);
    }

    // number of terms in the query that were found in the document
    public float coord(int overlap, int maxOverlap) {

        // else: can't get here
        return super.coord(overlap, maxOverlap);
    }

    // Note: this happens an index time, which we don't take advantage of (too many indices!)
    public float computeNorm(String fieldName, FieldInvertState state){

        // else: can't get here
        return super.computeNorm(state);
    }
}

但是，对于与输入具有 100% 相似度的确切文档，它不会 return 值 1。

如果我将用户输入如下：Logic Based Knowledge Representation 我得到的输出和 TFIDF 分数是（对于与输入具有 100% 相似度的文档为 5.165）：

3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663

这是正常现象还是我的 tfidf 实现有问题？

非常感谢！

Answer 1

首先 - Lucene 已经具有 TF-IDF 相似性 - org.apache.lucene.search.similarities.TFIDFSimilarity

第二个 -

tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus

我已经标记了单词，所以这个tf-idf的东西只适用于一个单词的查询，但是当查询有多个单词时，tf-idf会这样做：

One of the simplest ranking functions is computed by summing the tf–idf for each query term

所以，这就是为什么 tf-idf 可以 return 你的分数超过 1

的原因

Lucene TFIDF 不会 return 1 对于与特定文档完全相同的查询

Lucene TFIDF does not return 1 for exactly same query with certain document

lucene

tf-idf