在 ES 中索引 pdf 文件后无法看到整个 pdf 内容

Question

下面是我在 Elasticsearch 中索引 pdf url 的代码：

import requests
from elasticsearch import Elasticsearch
es = Elasticsearch()
body = {
     "description" : "Extract attachment information",
     "processors" : [
        {
            "attachment" : {
            "field" : "data"
        }
      }
 ]
}
es.index(index='_ingest', doc_type='pipeline', id='attachment', body=body)
url = 'https://pubs.vmware.com/nsx-63/topic/com.vmware.ICbase/PDF/nsx_63_cross_vc_install.pdf'
response = requests.get(url)
import base64

data = base64.b64encode(response.content).decode('ascii')
 result2 = es.index(index='my_index', doc_type='my_type', pipeline='attachment',
              body={'data': data})
 result2
 doc = es.get(index='my_index', doc_type='my_type', id=result2['_id'],   _source_exclude=['data'])
 doc
 print(doc['_source']['attachment']['content'])

最后一行打印 pdf 文件的内容，只有 126 页中的 63 页。我是否需要在某处更改任何设置（已经尝试增加控制台 o/p，帮助）。

请指点。

Answer 1

提取的字符有 100000 个限制。您可以通过设置 indexed_chars.

在管道定义中更改它

见https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

在 ES 中索引 pdf 文件后无法看到整个 pdf 内容

Unable to see whole pdf content after indexing pdf file in ES

python

elasticsearch

elasticsearch-plugin