如何计算出现3个词的文档数量(java)
how to calculate the number of documents that 3 words appear on it (java)
我已经为 files.It 的集合构建了倒排索引 (wordTodocumentQueryMap) 包含文件编号和出现的每个单词的频率
像这样:
experiment 1:1 17:1 30:1 39:1 52:1 109:2
*************
empirical 1:1 38:3 58:1 109:1 110:1
*************
flow: 1:1 2:6 3:2 4:3 6:1 7:3 9:3 16:1 17:1
现在我需要查询(差不多3个词),结果应该是所有词都出现的文档。 (实验经验流)的结果应该是
1 : 3
其中 1 是文档编号,3 是查询词的总词频
但我的结果是:
1 : 3 2 : 6 3 : 2 4 : 3 6 : 1 7 : 3 9 : 3 16 : 1 17 : 2
问题在于它为每个单词枚举了所有文件
这是我目前得到的代码
public static TreeMap<Integer, Integer> FileScore=new TreeMap<>();
主要
for(Map.Entry<String, Map<Integer,Integer>> wordTodocument : wordTodocumentQueryMap.entrySet())
{
Map<Integer, Integer> documentToFrecuency_value = wordTodocument.getValue();
for(Map.Entry<Integer, Integer> documentToFrecuency : documentToFrecuency_value.entrySet())
{
int documentNo = documentToFrecuency.getKey();
int wordCount = documentToFrecuency.getValue();
int score=getScore(documentNo);
FileScore.put(documentNo, score+wordCount);
}
}
//print the score
for(Map.Entry<Integer,Integer> FileToScore : FileScore.entrySet())
{
int documentNo = FileToScore.getKey();
int Score = FileToScore.getValue();
System.out.print( documentNo +" : "+ Score+"\t");
}
public static int getScore (int fileno){
if(FileScore.containsKey(fileno))
return FileScore.get(fileno);
return 0;
}
下面的方法应该可以做到。
/**
* Finds docuiments where all the given words appear.
*
* @param wordTodocumentQueryMap For each word maps file no. to frequency > 0
* @param firstWord
* @param otherWords
* @return A frequency map containing file no. of files containing all of fisrtWord and otherWords mapped
* to a sum of counts for the words.
*/
public static Map<Integer, Integer> docsWithAllWords(Map<String, Map<Integer, Integer>> wordTodocumentQueryMap,
String firstWord, String... otherWords) {
// result
Map<Integer, Integer> fileScore = new TreeMap<>();
Map<Integer, Integer> firstWordCounts = wordTodocumentQueryMap.get(firstWord);
if (firstWordCounts == null) { // first word not found in any doc
// return empty result
return fileScore;
}
outer: for (Map.Entry<Integer, Integer> firstWordCountsEntry : firstWordCounts.entrySet()) {
Integer docNo = firstWordCountsEntry.getKey();
int sumOfCounts = firstWordCountsEntry.getValue();
// find out if both/all other words are in doc, and sum counts
for (String word : otherWords) {
Map<Integer, Integer> wordCountEntry = wordTodocumentQueryMap.get(word);
if (wordCountEntry == null) {
return fileScore;
}
Integer wordCount = wordCountEntry.get(docNo);
if (wordCount == null) { // word not found in doc
continue outer;
}
sumOfCounts += wordCount;
}
fileScore.put(docNo, sumOfCounts);
}
return fileScore;
}
它有一个在Java中很少使用的功能:标签,outer
。如果您发现它太不寻常(或者只是不喜欢 continue
语句),您可以重写以使用布尔值。现在你可以打电话
docsWithAllWords(wordTodocumentQueryMap, "experiment", "empirical", "flow")
它只会给你 1 : 3
而不是别的。
我已经为 files.It 的集合构建了倒排索引 (wordTodocumentQueryMap) 包含文件编号和出现的每个单词的频率
像这样:
experiment 1:1 17:1 30:1 39:1 52:1 109:2
*************
empirical 1:1 38:3 58:1 109:1 110:1
*************
flow: 1:1 2:6 3:2 4:3 6:1 7:3 9:3 16:1 17:1
现在我需要查询(差不多3个词),结果应该是所有词都出现的文档。 (实验经验流)的结果应该是
1 : 3
其中 1 是文档编号,3 是查询词的总词频
但我的结果是:
1 : 3 2 : 6 3 : 2 4 : 3 6 : 1 7 : 3 9 : 3 16 : 1 17 : 2
问题在于它为每个单词枚举了所有文件
这是我目前得到的代码
public static TreeMap<Integer, Integer> FileScore=new TreeMap<>();
主要
for(Map.Entry<String, Map<Integer,Integer>> wordTodocument : wordTodocumentQueryMap.entrySet())
{
Map<Integer, Integer> documentToFrecuency_value = wordTodocument.getValue();
for(Map.Entry<Integer, Integer> documentToFrecuency : documentToFrecuency_value.entrySet())
{
int documentNo = documentToFrecuency.getKey();
int wordCount = documentToFrecuency.getValue();
int score=getScore(documentNo);
FileScore.put(documentNo, score+wordCount);
}
}
//print the score
for(Map.Entry<Integer,Integer> FileToScore : FileScore.entrySet())
{
int documentNo = FileToScore.getKey();
int Score = FileToScore.getValue();
System.out.print( documentNo +" : "+ Score+"\t");
}
public static int getScore (int fileno){
if(FileScore.containsKey(fileno))
return FileScore.get(fileno);
return 0;
}
下面的方法应该可以做到。
/**
* Finds docuiments where all the given words appear.
*
* @param wordTodocumentQueryMap For each word maps file no. to frequency > 0
* @param firstWord
* @param otherWords
* @return A frequency map containing file no. of files containing all of fisrtWord and otherWords mapped
* to a sum of counts for the words.
*/
public static Map<Integer, Integer> docsWithAllWords(Map<String, Map<Integer, Integer>> wordTodocumentQueryMap,
String firstWord, String... otherWords) {
// result
Map<Integer, Integer> fileScore = new TreeMap<>();
Map<Integer, Integer> firstWordCounts = wordTodocumentQueryMap.get(firstWord);
if (firstWordCounts == null) { // first word not found in any doc
// return empty result
return fileScore;
}
outer: for (Map.Entry<Integer, Integer> firstWordCountsEntry : firstWordCounts.entrySet()) {
Integer docNo = firstWordCountsEntry.getKey();
int sumOfCounts = firstWordCountsEntry.getValue();
// find out if both/all other words are in doc, and sum counts
for (String word : otherWords) {
Map<Integer, Integer> wordCountEntry = wordTodocumentQueryMap.get(word);
if (wordCountEntry == null) {
return fileScore;
}
Integer wordCount = wordCountEntry.get(docNo);
if (wordCount == null) { // word not found in doc
continue outer;
}
sumOfCounts += wordCount;
}
fileScore.put(docNo, sumOfCounts);
}
return fileScore;
}
它有一个在Java中很少使用的功能:标签,outer
。如果您发现它太不寻常(或者只是不喜欢 continue
语句),您可以重写以使用布尔值。现在你可以打电话
docsWithAllWords(wordTodocumentQueryMap, "experiment", "empirical", "flow")
它只会给你 1 : 3
而不是别的。