在 Lucene TFIDF 中为特定条件自定义分数

Customize score for certain condition in Lucene TFIDF

我有一个程序接受输入查询并根据其 TFIDF 得分对相似文档进行排名。问题是,我想添加一些关键字并将它们也视为 "input"。这些关键字对于每个查询都是不同的。

例如如果查询是"Logic Based Knowledge Representation"则单词如下:

Level 0 keywords: [logic, base, knowledg, represent]

Level 1 keywords: [tempor, modal, logic, resolut, method, decis, problem,
                   reason, revis, hybrid, represent]

Level 2 keywords: [classif, queri, process, techniqu, candid, semant, data, 
                   model, knowledg, base, commun, softwar, engin, subsumpt,
                   kl, undecid, classic, structur, object, field]

我想以不同的方式对待评分,例如,对于文档中等于 0 级单词的术语,我想将分数乘以 1。对于文档中等于 1 级单词的术语, 将分数乘以 0.8。最后,对于文档中等于级别 2 中单词的术语,将分数乘以 0.64。

我的目的是扩展输入查询,同时确保包含更多级别 0 关键字的文档被视为更重要,而包含级别 1 和 2 关键字的文档较少(即使输入已扩展)。 我没有把它包括在我的程序中。到目前为止,我的程序只计算查询中所有文档的 TFIDF 分数并对结果进行排名:

public class Ranking{

    private static int maxHits = 2000000;

    public static void main(String[] args) throws Exception {        
        System.out.println("Enter your paper title: ");
        BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

        String paperTitle = null;
        paperTitle = br.readLine(); 

       // CitedKeywords ckeywords = new CitedKeywords();
       // ckeywords.readDataBase(paperTitle);

        String querystr = args.length > 0 ? args[0] :paperTitle;
        StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
        Query q = new QueryParser(Version.LUCENE_42, "title", analyzer)
            .parse(querystr);

        IndexReader reader = DirectoryReader.open(
                             FSDirectory.open(
                             new File("E:/Lucene/new_bigdataset_index")));        

        IndexSearcher searcher = new IndexSearcher(reader);

        VSMSimilarity vsmSimiliarty = new VSMSimilarity();  
        searcher.setSimilarity(vsmSimiliarty);
        TopDocs hits = searcher.search(q, maxHits);
        ScoreDoc[] scoreDocs = hits.scoreDocs;

        PrintWriter writer = new PrintWriter("E:/Lucene/result/1.txt", "UTF-8");

        int counter = 0;
        for (int n = 0; n < scoreDocs.length; ++n) {
            ScoreDoc sd = scoreDocs[n];
            float score = sd.score;
            int docId = sd.doc;
            Document d = searcher.doc(docId);
            String fileName = d.get("title");
            String year = d.get("pub_year");
            String paperkey = d.get("paperkey");
            System.out.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
            writer.printf("%s,%s,%s,%4.3f\n", paperkey, fileName, year, score);
        ++counter;
        }
        writer.close();      
    }
}    

--

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.search.similarities.DefaultSimilarity;

public class VSMSimilarity extends DefaultSimilarity{

    // Weighting codes
    public boolean doBasic     = true;  // Basic tf-idf
    public boolean doSublinear = false; // Sublinear tf-idf
    public boolean doBoolean   = false; // Boolean

    //Scoring codes
    public boolean doCosine    = true;
    public boolean doOverlap   = false;

    private static final long serialVersionUID = 4697609598242172599L;

    // term frequency in document = 
    // measure of how often a term appears in the document
    public float tf(int freq) {     
        // Sublinear tf weighting. Equation taken from [1], pg 127, eq 6.13.
        if (doSublinear){
            if (freq > 0){
                return 1 + (float)Math.log(freq);
            } else {
                return 0;
            }
        } else if (doBoolean){
            return 1;
        }
        // else: doBasic
        // The default behaviour of Lucene is sqrt(freq), 
        // but we are implementing the basic VSM model
        return freq;
    }

    // inverse document frequency = 
    // measure of how often the term appears across the index
    public float idf(int docFreq, int numDocs) {
        if (doBoolean || doOverlap){
            return 1;
        }
        // The default behaviour of Lucene is 
        // 1 + log (numDocs/(docFreq+1)), 
        // which is what we want (default VSM model)
        return super.idf(docFreq, numDocs); 
    }

    // normalization factor so that queries can be compared 
    public float queryNorm(float sumOfSquaredWeights){
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return super.queryNorm(sumOfSquaredWeights);
        }
        // else: can't get here
        return super.queryNorm(sumOfSquaredWeights);
    }

    // number of terms in the query that were found in the document
    public float coord(int overlap, int maxOverlap) {
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return 1;
        }
        // else: can't get here
        return super.coord(overlap, maxOverlap);
    }

    // Note: this happens an index time, which we don't take advantage of
    // (too many indices!)
    public float computeNorm(String fieldName, FieldInvertState state){
        if (doOverlap){
            return 1;
        } else if (doCosine){
            return super.computeNorm(state);
        }
        // else: can't get here
        return super.computeNorm(state);
    }
}

下面是我当前程序的示例输出(没有提升分数):

3086,Logic Based Knowledge Representation.,1999,5.165
33586,A Logic for the Representation of Spatial Knowledge.,1991,4.663
328937,Logic Programming for Knowledge Representation.,2007,4.663
219720,Logic for Knowledge Representation.,1984,4.663
487587,Knowledge Representation with Logic Programs.,1997,4.663
806195,Logic Programming as a Representation of Knowledge.,1983,4.663
806833,The Role of Logic in Knowledge Representation.,1983,4.663
744914,Knowledge Representation and Logic Programming.,2002,4.663
1113802,Knowledge Representation in Fuzzy Logic.,1989,4.663
984276,Logic Programming and Knowledge Representation.,1994,4.663

任何人都可以告诉我如何为我上面提到的条件添加分数吗? Lucene有提供这种功能吗?我可以将它集成到 VSMSimilarity class 吗?

编辑: 我在 Lucene 文档中找到了这个:

 public void setBoost(float b)

将此查询子句的提升设置为 b。匹配此子句的文档(除了正常权重之外)的分数将乘以 b。

不幸的是,这似乎乘以了文档级别的分数。我想做一个学期水平的分数乘法,但我还没有找到这样做的方法。所以如果一个文档包含来自 level0 和 level1 的词,只有来自 level1 的词会乘以 0.8,例如

您可以使用 Lucene 术语提升。

https://lucene.apache.org/core/5_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term

像这样扩充您的查询(假设 OR 是默认运算符)

logic base knowledge representation temporal^0.8 modal^0.8 classification^0.64...

并使用标准相似性提供者之一。

PS:在您的示例中找到 LUCENE_42。几乎任何版本的 Lucene 都存在这个特性(我记得它在 2.4.9 中就有)。​​