如何计算出现3个词的文档数量(java)

Question

我已经为 files.It 的集合构建了倒排索引 (wordTodocumentQueryMap) 包含文件编号和出现的每个单词的频率

像这样：

experiment      1:1     17:1    30:1    39:1    52:1    109:2
*************
empirical       1:1     38:3    58:1    109:1   110:1   
*************
flow:           1:1     2:6     3:2     4:3     6:1      7:3     9:3     16:1   17:1

现在我需要查询（差不多3个词），结果应该是所有词都出现的文档。（实验经验流）的结果应该是

1 : 3

其中 1 是文档编号，3 是查询词的总词频

但我的结果是：

1 : 3   2 : 6   3 : 2   4 : 3   6 : 1   7 : 3   9 : 3   16 : 1  17 : 2

问题在于它为每个单词枚举了所有文件

这是我目前得到的代码

public static TreeMap<Integer, Integer> FileScore=new TreeMap<>();

主要

for(Map.Entry<String, Map<Integer,Integer>> wordTodocument : wordTodocumentQueryMap.entrySet())
    {
    Map<Integer, Integer> documentToFrecuency_value = wordTodocument.getValue();
        for(Map.Entry<Integer, Integer> documentToFrecuency : documentToFrecuency_value.entrySet())
            {
             int documentNo = documentToFrecuency.getKey();
             int wordCount = documentToFrecuency.getValue();
             int score=getScore(documentNo);

                 FileScore.put(documentNo, score+wordCount);
         }

    }

//print the score

for(Map.Entry<Integer,Integer> FileToScore : FileScore.entrySet())
{
       int documentNo = FileToScore.getKey();
       int Score = FileToScore.getValue();
       System.out.print( documentNo +" : "+ Score+"\t");

    }


public static int getScore (int fileno){
if(FileScore.containsKey(fileno))
    return FileScore.get(fileno);
return 0;
}

Answer 1

下面的方法应该可以做到。

/**
 * Finds docuiments where all the given words appear.
 * 
 * @param wordTodocumentQueryMap For each word maps file no. to frequency > 0
 * @param firstWord 
 * @param otherWords
 * @return A frequency map containing file no. of files containing all of fisrtWord and otherWords mapped
 *         to a sum of counts for the words.
 */
public static Map<Integer, Integer> docsWithAllWords(Map<String, Map<Integer, Integer>> wordTodocumentQueryMap,
        String firstWord, String... otherWords) {
    // result
    Map<Integer, Integer> fileScore = new TreeMap<>();
    Map<Integer, Integer> firstWordCounts = wordTodocumentQueryMap.get(firstWord);
    if (firstWordCounts == null) { // first word not found in any doc
        // return empty result
        return fileScore;
    }
    outer: for (Map.Entry<Integer, Integer> firstWordCountsEntry : firstWordCounts.entrySet()) {
        Integer docNo = firstWordCountsEntry.getKey();
        int sumOfCounts = firstWordCountsEntry.getValue();
        // find out if both/all other words are in doc, and sum counts
        for (String word : otherWords) {
            Map<Integer, Integer> wordCountEntry = wordTodocumentQueryMap.get(word);
            if (wordCountEntry == null) {
                return fileScore;
            }
            Integer wordCount = wordCountEntry.get(docNo);
            if (wordCount == null) { // word not found in doc
                continue outer;
            }
            sumOfCounts += wordCount;
        }
        fileScore.put(docNo, sumOfCounts);
    }
    return fileScore;
}

它有一个在Java中很少使用的功能：标签，outer。如果您发现它太不寻常（或者只是不喜欢 continue 语句），您可以重写以使用布尔值。现在你可以打电话

docsWithAllWords(wordTodocumentQueryMap, "experiment", "empirical", "flow")

它只会给你 1 : 3 而不是别的。

如何计算出现3个词的文档数量(java)

how to calculate the number of documents that 3 words appear on it (java)

java

information-retrieval

treemap