SOLR Cell如何添加文档内容？

Question

SOLR 有一个名为 Cell 的模块。它使用 Tika 从文档中提取内容并使用 SOLR 对其进行索引。

从 https://github.com/apache/lucene-solr/tree/master/solr/contrib/extraction 的来源，我得出结论，Cell 将原始提取的文本文档文本放入名为 "content" 的字段中。该字段由 SOLR 索引，但不存储。当您查询文档时，"content" 没有出现。

我的 SOLR 实例没有架构（我保留了默认架构）。

我正在尝试使用默认 UpdateRequestHandler（POST 到 /solr/corename/update）实现类似的行为。 POST 请求：

<add commitWithin="60000">
    <doc>
        <field name="content">lorem ipsum</field>
        <field name="id">123456</field>
        <field name="someotherfield_i">17</field>
    </doc>
</add>

以这种方式添加文档后，内容字段被索引并存储。它存在于查询结果中。我不想这样；这是浪费 space.

关于 Cell 添加文档的方式，我遗漏了什么？

Answer 1

如果您不想让您的字段存储内容，您必须将字段设置为存储="false"。

由于您使用的是无模式模式（仍然有一个模式，它只是在添加新字段时动态生成），您将不得不使用 Schema API 来更改字段。

你可以do this by issuing a replace-field command:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
  "name":"content",
  "type":"text",
  "stored":false }
}' http://localhost:8983/solr/collection/schema

您可以 see the defined fields 通过向 /collection/schema/fields 发出请求。

Answer 2

Cell 代码确实将内容添加到文档中作为 content，但是有一个内置的字段转换规则将 content 替换为 _text_。在无模式 SOLR 中，_text_ 被标记为不用于存储。

规则由 SolrContentHandler.addField() 中的以下行调用：

String name = findMappedName(fname);

在params 对象中，有一条规则fmap.content 应该被视为_text_。它来自 corename\conf\solrconfig.xml，默认情况下有以下片段：

<requestHandler name="/update/extract"
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="fmap.meta">ignored_</str>
    <str name="fmap.content">_text_</str> <!-- This one! -->
  </lst>
</requestHandler>

与此同时，在 corename\conf\managed_schema 中有一行：

<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

这就是整个故事。

SOLR Cell如何添加文档内容？

How does SOLR Cell add document content?

solr

solr-cell