如何使用 Elasticsearch 摄取附件插件索引 pdf 文件?

How to index a pdf file using Elasticsearch ingest-attachment plugin?

我必须使用 Elasticsearch 摄取插件在 pdf 文档中实现基于全文的搜索。当我尝试在 pdf 文档中搜索单词 someword 时,我得到一个空的命中数组。

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\Users\myname\Desktop\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}

当您使用第二个命令通过传递 Base64 编码的内容为您的文档编制索引时,文档将如下所示:

        {
           "filename": "C:\Users\myname\Desktop\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

因此您的查询需要查看 attachment.content 字段而不是 data 字段(它仅用于在索引期间发送原始内容)

将您的查询修改为此,它将起作用:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS:发送有效载荷时使用POST而不是GET