Lucene.net 构建索引时使用率高 CPU

Lucene.net high CPU usage while building index

我编写了一个程序,使用 Lucene.net 来索引一个 3GB 的文本文件。建索引时,进程的CPU消耗高达80,内存占用高达~1GB。 有没有办法限制 CPU 和内存使用? 下面是我用来构建索引的程序-

public void BuildIndex(string item)
        {
            System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Indexing Started for " + item);
            string indexPath = string.Format(BaseIndexPath, "20200414", item);
            if (System.IO.Directory.Exists(indexPath))
            {
                System.IO.Directory.Delete(indexPath, true);
            }


            LuceneIndexDirectory = FSDirectory.Open(indexPath);
            Writer = new IndexWriter(LuceneIndexDirectory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);


            Writer.SetRAMBufferSizeMB(500);

            string file = "c:\LogFile.txt";
            string line=string.Empty;
            int count = 0;
            StreamReader fileReader = new StreamReader(file);
            while ((line = fileReader.ReadLine()) != null)
            {
                count++;
                Document doc = new Document();

                try
                {
                    doc.Add(new Field("LineNumber", count.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
                    doc.Add(new Field("LogTime", line.Substring(6, 12), Field.Store.YES, Field.Index.NOT_ANALYZED));
                    doc.Add(new Field("LineText", line.Substring(18, line.Length -18 ), Field.Store.YES, Field.Index.NOT_ANALYZED));
                    Writer.AddDocument(doc);
                }
                catch (Exception)
                {

                    System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Exception ocurred while entring a line in the index");
                }

            }
            System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Indexing finished for " + item + ". Starting Optimization now.");
            Writer.Optimize();
            Writer.Commit();

            Writer.Close();


            LuceneIndexDirectory.Dispose();

            System.Diagnostics.EventLog.WriteEntry("LuceneSearch", "Optimization finished for " + item );
        }

编写索引通常是在搜索带外完成的。也就是说,它通常在部署或应用程序启动期间完成。当然,也可以进行近乎实时的搜索,这涉及保持打开的 IndexWriter 用于写入和搜​​索相同的索引,但在这种情况下,典型的应用程序会在以下位置添加一些文档一次,它不会一次建立整个索引。

一般来说,如果您在应用程序生命周期的正确时间点构建索引,那么使用这么多 RAM 并不是什么大问题。

但是,您调用 Optimize() 时不带任何参数,这就是在您创建索引后 重写 整个索引。如果您的书面索引占用了多个段,则不带参数调用 Optimize() 会将整个索引重写为一个段。

来自文档(强调我的):

Requests an "optimize" operation on an index, priming the index for the fastest available search. Traditionally this has meant merging all segments into a single segment as is done in the default merge policy, but individaul merge policies may implement optimize in different ways.

It is recommended that this method be called upon completion of indexing. In environments with frequent updates, optimize is best done during low volume times, if at all.

See http://www.gossamer-threads.com/lists/lucene/java-dev/47895 for more discussion.

Note that optimize requires 2X the index size free space in your Directory (3X if you're using compound file format). For example, if your index size is 10 MB then you need 20 MB free for optimize to complete (30 MB if you're using compound fiel format).

If some but not all readers re-open while an optimize is underway, this will cause > 2X temporary space to be consumed as those new readers will then hold open the partially optimized segments at that time. It is best not to re-open readers while optimize is running.

请注意,Optimize() 方法已在 Lucene 4.x 中删除(有充分的理由),因此我建议您现在停止使用它。