使用 solrcell 和 tika 索引丰富的文档

Question

我是 Solr 搜索的新手，目前正在努力让 solr Cell 与 Tika 一起工作。考虑以下文本文件：

Name:                    Popeye
Nationality:             American

我希望 Solr return 我有两个名为 'name' 和 'nationality' 的字段，其值为 popeye 和 american。为此，我在 schema.xml 文件中将两个字段定义为

   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="nationality" type="text_general" indexed="true" stored="true"/>

text_general字段定义为

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <!-- in this example, we will only use synonyms at query time
                 <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

在solrconfig.xml文件中，我定义了update/extract方法

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">true</str>

最后，我运行命令将文档索引为

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/update/extract?literal.id=doc1&commit=true' -F "myfile=@/tmp/popeye_bio.txt"

文档已正确编入索引。当我将查询命令用作

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/select?q=*%3A*&wt=json&indent=true'

我得到的输出是

    {
    "responseHeader":{
    "status":0,
    "QTime":3,
    "params":{
      "indent":"true",
      "q":"*:*",
      "wt":"json"}},
      "response":{"numFound":1,"start":0,"docs":[
      {
        "attr_meta":["stream_source_info",
          "myfile",
          "stream_content_type",
          "text/plain",
          "stream_size",
          "206",
          "Content-Encoding",
          "windows-1252",
          "stream_name",
          "popeye_bio.txt",
          "Content-Type",
          "text/plain; charset=windows-1252"],
        "id":"doc1",
        "attr_stream_source_info":["myfile"],
        "attr_stream_content_type":["text/plain"],
        "attr_stream_size":["206"],
        "attr_content_encoding":["windows-1252"],
        "attr_stream_name":["popeye_bio.txt"],
        "attr_content_type":["text/plain; charset=windows-1252"],
        "attr_content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Name:                    Popeye\r\nNationality:             American\r\n \n  "],
        "_version_":1567726521681969152}]
  }}

如您所见，我在 schema.xml 文件中定义的字段中没有索引 popeye 和 american。我在这里做错了什么？我尝试将 text_general 字段类型中的分词器更改为 <tokenizer class="solr.PatternTokenizerFactory" pattern=": "/>。但这没有任何区别。在这方面，我将不胜感激！

Answer 1

当你定义一个 tokenizer 时，你只是向 Solr 表明 all 在那个字段中发送的数据应该 tokenized/processed 与你的配置一样，但是在最后，您将所有信息发送到 one 字段。

Solr 假定您的数据是结构化的（1 个包含字段的文档）。所以一个 analyzer/tokenizer 无法创建更多字段。 analyzer/tokenizer 的功能基本上只是对要进入倒排索引的文本进行标记和转换以供搜索。

您可以做的是使用 ScriptUpdateProcessor 并定义管道以在文本进入分词器之前进行修改（将一个字段拆分为多个字段）。类似于：

<processor class="solr.StatelessScriptUpdateProcessorFactory">
    <str name="script">splitField.js</str>
</processor>

并且 splitField.js 文件可能包含如下内容：

function processAdd(cmd) {
    doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
    field = doc.getFieldValue("attr_content");

    // split your attr_content text into two variables:
    // name and nationality, then

    doc.setField("name", name);
    doc.setField("nationality", nationality);
}

在理想情况下，这应该在 Solr 之外处理，但是使用 ScriptUpdateProcessor 你可以完成你想要的。

Answer 2

我目前的做法是在update/extract方法

中定义一个'update.chain'

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
<str name="update.chain">mychain</str>
    <str name="lowernames">true</str>
    <str name="uprefix">attr_</str>
    <str name="captureAttr">true</str>

其中 mychain 是

<updateRequestProcessorChain name="mychain">
     <processor class="solr.StatelessScriptUpdateProcessorFactory">
            <str name="script">splitField.js</str>
     </processor>
  <processor class="solr.LogUpdateProcessorFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

我将其包含在 update/extract 方法中，以便调用处理器。如果我理解正确的话，我应该在 update/extract 方法之后和文本发送到分词器之前调用处理器。如果是这样，那么处理器将如何被调用？

我还尝试从 update/extract 中删除 <str name="update.chain">mychain</str> 行，然后调用

curl 'http://localhost:8983/solr/popeye_bio_collection_shard1_replica1/update/extract?literal.id=doc1&update.chain=mychain&commit=true' -F "myfile=@/tmp/popeye_bio.txt"

我得到了同样的错误。 splitFiled.js 定义为

function processAdd(cmd) {
doc = cmd.solrDoc; // org.apache.solr.common.SolrInputDocument
field = doc.getFieldValue("attr_content");
// split your attr_content text into two variables:
// name and nationality, then
doc.setField("name", name);
doc.setField("nationality", nationality);
}

function processDelete(cmd) {
}

function processMergeIndexes(cmd) {
}

function processCommit(cmd) {
}

function processRollback(cmd) {
}

function finish() {
}

错误发生在setField行。有什么办法可以在控制台中打印“字段”吗？也许，“console.log”方法？

使用 solrcell 和 tika 索引丰富的文档

index rich documents using solrcell and tika

indexing

solr

apache-tika

cloudera-manager