如何使用 Elasticsearch 摄取附件插件索引 pdf 文件?
How to index a pdf file using Elasticsearch ingest-attachment plugin?
我必须使用 Elasticsearch
摄取插件在 pdf 文档中实现基于全文的搜索。当我尝试在 pdf 文档中搜索单词 someword
时,我得到一个空的命中数组。
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\Users\myname\Desktop\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
当您使用第二个命令通过传递 Base64 编码的内容为您的文档编制索引时,文档将如下所示:
{
"filename": "C:\Users\myname\Desktop\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
因此您的查询需要查看 attachment.content
字段而不是 data
字段(它仅用于在索引期间发送原始内容)
将您的查询修改为此,它将起作用:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
PS:发送有效载荷时使用POST
而不是GET
我必须使用 Elasticsearch
摄取插件在 pdf 文档中实现基于全文的搜索。当我尝试在 pdf 文档中搜索单词 someword
时,我得到一个空的命中数组。
//Code for creating pipeline
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : -1
}
}
]
}
//Code for creating the index
PUT my_index/my_type/my_id?pipeline=attachment
{
"filename" : "C:\Users\myname\Desktop\bh1.pdf",
"title" : "Quick",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
//Code for searching the word in pdf
GET /my_index/my_type/_search
{
"query": {
"match": {
"data" : {
"query" : "someword"
}
}
}
当您使用第二个命令通过传递 Base64 编码的内容为您的文档编制索引时,文档将如下所示:
{
"filename": "C:\Users\myname\Desktop\bh1.pdf",
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
},
"title": "Quick"
}
因此您的查询需要查看 attachment.content
字段而不是 data
字段(它仅用于在索引期间发送原始内容)
将您的查询修改为此,它将起作用:
POST /my_index/my_type/_search
{
"query": {
"match": {
"attachment.content": { <---- change this
"query": "lorem"
}
}
}
}
PS:发送有效载荷时使用POST
而不是GET