避免在 Solr 中重复文档

avoid duplicated documents in Solr

在使用 SolrJ 索引数据库文档时,我发现 Solr(5.2.1) 中存在重复文档。我想避免重复并根据 "id" 字段重写文档。通过我的谷歌搜索,"dedupe" 对于重复很有用,所以我将它应用于 solrconfig.xml 但遗憾的是它没有用。

if there are two same documents then rewrite with latest one. for example,
   "id" = 750000 "title" = here I am
   "id" = 750000 "title" = here you are 
hence, final result would be "id" =750000 "title" = here you are

    //here is my part of schema.xml

    <field name="id" type="long" indexed="true" stored="true" required="true"/>
    <field name="title" type="string" indexed="true" stored="true" required="true" />
    <field name="unique_id" type="string" multiValued="false" indexed="true" required="false" stored="true"
    <uniqueKey>unique_id</uniqueKey>

    //below code is solrconfig.xml   

    <updateRequestProcessorChain name="dedupe">

    <processor class="solr.processor.SignatureUpdateProcessorFactory">
         <bool name="enabled">true</bool>
         <str name="signatureField">id</str>
         <bool name="overwriteDupes">true</bool>
         <str name="fields">id</str>
         <str name="signatureClass">solr.processor.TextProfileSignature</str>
       </processor>

       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

need your kind advice.

below code is core parts of my indexing programe with SolrJ (edited on 2015.05.08)

 SolrClient solr = new HttpSolrClient(urlArray[i]); //localhost:8983/solr/#/core_name[i] 
      String id;
      SolrInputDocument doc = new SolrInputDocument();
      UpdateResponse response;
      String[] array;

      for (Map.Entry<String,Object> entry : list.get(i).entrySet()) { // get my DB values such as id, title ,description...

        array = String.valueOf(entry.getValue()).split(","); // split DB values depend on ","
        id = entry.getKey();
        doc.addField("id", entry.getKey()); // unique id
        doc.addField("title", array[1]);

        doc.addField("link", array[2]);
        doc.addField("description", array[3]);
        response = solr.add(doc);

        doc.clear();

      }

      solr.commit();
      solr.close();

确保更改您的更新处理程序(您在 SolrJ 中使用的那些)以使用定义的链(在您的情况下 "dedupe")

<requestHandler name="/update" class="solr.UpdateRequestHandler" >
  <lst name="defaults">
    <str name="update.chain">dedupe</str>
  </lst>
...
</requestHandler>

看看这个url https://cwiki.apache.org/confluence/display/solr/De-Duplication