在 SQL 数据库中记录来自 dtSearch 的所有 DocId 和文件名的最快方法

Question

我正在将 dtSearch 与 SQL 数据库结合使用，并希望维护一个包含所有 DocId 及其相关文件名的 table。从那里，我将添加一个带有外键的列，以允许我结合文本和数据库搜索。

我有代码可以简单地 return 索引中的所有记录并将它们一一添加到数据库中。然而，这需要永远，并且没有解决如何在将新记录添加到索引时简单地附加新记录的问题。但以防万一它有帮助：

MyDatabaseContext db = new StateScapeEntities();
IndexJob ij = new dtSearch.Engine.IndexJob();

ij.IndexPath = @"d:\myindex";

IndexInfo indexInfo = dtSearch.Engine.IndexJob.GetIndexInfo(@"d:\myindex");

bool jobDone =   ij.Execute();

SearchResults sr = new SearchResults();

uint n = indexInfo.DocCount;

for (int i = 1; i <= n; i++)
{
    sr.AddDoc(ij.IndexPath, i, null);
}

for (int i = 1; i <= n; i++)
{
    sr.GetNthDoc(i - 1);
        //IndexDocument is defined elsewhere
        IndexDocument id = new IndexDocument();
        id.DocId = sr.CurrentItem.DocId;
        id.FilePath = sr.CurrentItem.Filename;

        if (id.FilePath != null)
        {
            db.IndexDocuments.Add(id);
            db.SaveChanges();           
        }   
}

Answer 1

为了提高速度，您可以搜索单词“xfirstword”并获取索引中的所有文档。

你也可以看看常见问题How to retrieve all documents in an index

Answer 2

要在索引中保留 DocId，您必须在 IndexJob

中使用标志 dtsIndexKeepExistingDocIds

DocID改变时也可以看dtSearch Text Retrieval Engine Programmer's Reference

当一个文档被添加到索引时，它会被分配一个 DocId，并且 DocId 总是按顺序编号。
重新索引文档时，旧的 DocId 将被取消并分配一个新的 DocId。
当一个索引被压缩时，索引中的所有DocId都会被重新编号以删除被取消的DocId，除非在IndexJob中设置了dtsIndexKeepExistingDocIds标志。
当一个索引合并到另一个索引时，目标索引中的 DocId 永远不会改变。合并到目标索引中的文档将全部分配新的、按顺序编号的 DocId，除非 (a) 在 IndexJob 中设置了 dtsIndexKeepExistingDocIds 标志，并且 (b) 索引具有不重叠的文档 ID 范围。

Answer 3

因此，我使用了 user2172986 的部分响应，但将其与一些额外的代码结合起来以获得我的问题的解决方案。我确实必须在我的索引更新例程中设置 dtsKeepExistingDocIds 标志。从那里，我只想将新创建的 DocId 添加到我的 SQL 数据库中。为此，我使用了以下代码：

string indexPath = @"d:\myindex"; 

        using (IndexJob ij = new dtSearch.Engine.IndexJob())
        {
            //make sure the updated index doesn't change DocIds
            ij.IndexingFlags = IndexingFlags.dtsIndexKeepExistingDocIds;
            ij.IndexPath = indexPath;
            ij.ActionAdd = true;
            ij.FoldersToIndex.Add( indexPath + "<+>");
            ij.IncludeFilters.Add( "*");
            bool jobDone = ij.Execute();
        }
        //create a DataTable to hold results
        DataTable newIndexDoc = MakeTempIndexDocTable(); //this is a custom method not included in this example; just creates a DataTable with the appropriate columns

        //connect to the DB;
        MyDataBase db = new MyDataBase(); //again, custom code not included - link to EntityFramework entity

        //get the last DocId in the DB?
        int lastDbDocId = db.IndexDocuments.OrderByDescending(i => i.DocId).FirstOrDefault().DocId;

        //get the last DocId in the Index
        IndexInfo indexInfo = dtSearch.Engine.IndexJob.GetIndexInfo(indexPath);

        uint latestIndexDocId = indexInfo.LastDocId;

        //create a searchFilter
        dtSearch.Engine.SearchFilter sf = new SearchFilter();

        int indexId = sf.AddIndex(indexPath);


        //only select new records (from one greater than the last DocId in the DB to the last DocId in the index itself
        sf.SelectItems(indexId, lastDbDocId + 1, int.Parse(latestIndexDocId.ToString()), true);

        using (SearchJob sj = new dtSearch.Engine.SearchJob())
        {
           sj.SetFilter(sf);
           //return every document in the specified range (using xfirstword)
           sj.Request = "xfirstword";
           // Specify the path to the index to search here
           sj.IndexesToSearch.Add(indexPath);


          //additional flags and limits redacted for clarity

           sj.Execute();

           // Store the error message in the status
           //redacted for clarity



           SearchResults results = sj.Results;
           int startIdx = 0;
           int endIdx = results.Count;
           if (startIdx==endIdx)
               return;


           for (int i = startIdx; i < endIdx; i++)
           {
               results.GetNthDoc(i);

               IndexDocument id = new IndexDocument();
               id.DocId = results.CurrentItem.DocId;
               id.FileName= results.CurrentItem.Filename;

               if (id.FileName!= null)
               {

                   DataRow row = newIndexDoc.NewRow();

                   row["DocId"] = id.DocId;
                   row["FileName"] = id.FileName;

                   newIndexDoc.Rows.Add(row);
               }


           }

           newIndexDoc.AcceptChanges();

           //SqlBulkCopy
           using (SqlConnection connection =
                  new SqlConnection(db.Database.Connection.ConnectionString))
           {
               connection.Open();

               using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connection))
               {
                   bulkCopy.DestinationTableName =
                       "dbo.IndexDocument";

                   try
                   {
                       // Write from the source to the destination.
                       bulkCopy.WriteToServer(newIndexDoc);
                   }
                   catch (Exception ex)
                   {
                       Console.WriteLine(ex.Message);
                   }
               }
           }

           newIndexDoc.Clear();
           db.UpdateIndexDocument();
        }

Answer 4

这是我使用 SearchResults 接口中的 AddDoc 方法的新解决方案：

首先从 IndexInfo 中获取 StartingDocID 和 LastDocID，然后像这样循环：

function GetFilename(paDocID: Integer): String;    
var
  lCOMSearchResults:       ISearchResults;
  lSearchResults_Count:    Integer;
begin
  if Assigned(prCOMServer) then
  begin
    lCOMSearchResults := prCOMServer.NewSearchResults as ISearchResults;
    lCOMSearchResults.AddDoc(GetIndexPath(prIndexContent), paDocID, 0);
    lSearchResults_Count := lCOMSearchResults.Count;

    if lSearchResults_Count = 1 then
    begin
      lCOMSearchResults.GetNthDoc(0);
      Result := lCOMSearchResults.DocDetailItem['_Filename'];
    end;
  end;
end

在 SQL 数据库中记录来自 dtSearch 的所有 DocId 和文件名的最快方法

Fastest way to record all DocIds and FileNames from dtSearch in SQL database

sql-server

dtsearch