Lucene(java framework) 是否默认计算文档与术语的 tf-idf 和余弦相似度?
Do Lucene(java framework) by default calculates the tf-idf and cosine similarity of a document against the term?
我正在开发一个基于搜索引擎的应用程序,并且正在研究 Lucene java 框架,我对 lucene 默认提供的评分功能感到困惑,即评分功能默认实现 tf-idf 和余弦相似度还是我们必须做其他事情?
public class LuceneTester {
String indexDir = "C:\Users\hamda\Documents\NetBeansProjects\luceneDemo\Index";
String dataDir = "C:\Users\hamda\Documents\NetBeansProjects\luceneDemo\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
tester.search("DataGuides");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException{
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed+" File indexed, time taken: "
+(endTime-startTime)+" ms");
}
我在下面的搜索功能末尾得到文档分数
private void search(String searchQuery) throws IOException, ParseException{
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(searchQuery);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime));
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println(scoreDoc.score+" File: "
+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
我用谷歌搜索了一下,发现了这个:
how can I implement the tf-idf and cosine similarity in Lucene?
任何帮助将不胜感激:)
当我在 http://lucene.apache.org/, i found out that lucene scoring model by default use this class DefaultSimilarity http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html which extends the TFIDFSimilarity class, http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html 中查看一些细节时
因此在该文档中指出,评分模型默认实现 tf-idf 和余弦相似度。任何地方我可能是错的,所以你可以纠正我:)
从 Lucene 6.0 开始,默认的相似度实现是 BM25Similarity, which implements BM25。
如果您想使用旧的标准相似性实现,请使用ClassicSimilarity。
要比较两者,您可以查看:
- 道格·特恩布尔BM25 The Next Generation of Lucene Relevance
- ElasticSearch 的 BM25 vs Lucene Default Similarity
我正在开发一个基于搜索引擎的应用程序,并且正在研究 Lucene java 框架,我对 lucene 默认提供的评分功能感到困惑,即评分功能默认实现 tf-idf 和余弦相似度还是我们必须做其他事情?
public class LuceneTester {
String indexDir = "C:\Users\hamda\Documents\NetBeansProjects\luceneDemo\Index";
String dataDir = "C:\Users\hamda\Documents\NetBeansProjects\luceneDemo\Data";
Indexer indexer;
Searcher searcher;
public static void main(String[] args) {
LuceneTester tester;
try {
tester = new LuceneTester();
tester.createIndex();
tester.search("DataGuides");
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
private void createIndex() throws IOException{
indexer = new Indexer(indexDir);
int numIndexed;
long startTime = System.currentTimeMillis();
numIndexed = indexer.createIndex(dataDir, new TextFileFilter());
long endTime = System.currentTimeMillis();
indexer.close();
System.out.println(numIndexed+" File indexed, time taken: "
+(endTime-startTime)+" ms");
}
我在下面的搜索功能末尾得到文档分数
private void search(String searchQuery) throws IOException, ParseException{
searcher = new Searcher(indexDir);
long startTime = System.currentTimeMillis();
TopDocs hits = searcher.search(searchQuery);
long endTime = System.currentTimeMillis();
System.out.println(hits.totalHits +
" documents found. Time :" + (endTime - startTime));
for(ScoreDoc scoreDoc : hits.scoreDocs) {
Document doc = searcher.getDocument(scoreDoc);
System.out.println(scoreDoc.score+" File: "
+ doc.get(LuceneConstants.FILE_PATH));
}
searcher.close();
}
}
我用谷歌搜索了一下,发现了这个: how can I implement the tf-idf and cosine similarity in Lucene? 任何帮助将不胜感激:)
当我在 http://lucene.apache.org/, i found out that lucene scoring model by default use this class DefaultSimilarity http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html which extends the TFIDFSimilarity class, http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html 中查看一些细节时 因此在该文档中指出,评分模型默认实现 tf-idf 和余弦相似度。任何地方我可能是错的,所以你可以纠正我:)
从 Lucene 6.0 开始,默认的相似度实现是 BM25Similarity, which implements BM25。
如果您想使用旧的标准相似性实现,请使用ClassicSimilarity。
要比较两者,您可以查看:
- 道格·特恩布尔BM25 The Next Generation of Lucene Relevance
- ElasticSearch 的 BM25 vs Lucene Default Similarity