ES dense_vector 字段:必须指定 'dims'

ES dense_vector field: 'dims' must be specified

我有一个 ElasticSearch (v7.5.1) 索引,其中有一个名为 ldadense_vector 字段,有 150 个维度。如 http://localhost:9200/documents/_mapping 所示,映射如下所示:

"documents": {
  "mappings": {
    [...]
    "lda": {
      "type":"dense_vector",
      "dims":150
    }
  }
}

当我尝试通过 Elasticsearch Client for Python (v7.1.0) 索引文档时,ES 抛出此错误消息:

{"type": "server", "timestamp": "2020-01-03T08:40:04,962Z", "level": "DEBUG", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "docker-cluster", "node.name": "8d468383f2cf", "message": "[documents][0] failed to execute bulk item
 (create) index {[documents][document][S_uPam8BUsDzizMKxpRR], source[{\"id\":42129,[...],\
"lda\":[0.031139032915234566,0.02878846414387226,0.026767859235405922,0.025012295693159103,0.02347283624112606,0.022111890837550163,0.02090011164546013,0.019814245402812958,0.0188356414437294,0.01794915273785591,0.01714235544204712,0.01640496961772442,0.015728404745459557,0.
015105433762073517,0.014529934152960777,0.013996675610542297,0.013501172885298729,0.013039554469287395,0.012608458288013935,0.012204954400658607,0.011826476082205772,0.011470765806734562,0.011135827749967575,0.010819895192980766,0.01052139326930046,0.010238921269774437,0.0,0
.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]}]}", "cluster.uuid": "7irLdTC_S7eXwYcVFolppQ", "node.id":
"M_fMZ3KxQnWP3AiguV1_jA" , 
"stacktrace": ["org.elasticsearch.index.mapper.MapperParsingException: The [dims] property must be specified for field [lda].",                                                                                                            [22/1876]
"at org.elasticsearch.xpack.vectors.mapper.DenseVectorFieldMapper$TypeParser.parse(DenseVectorFieldMapper.java:104) ~[?:?]",                                                                                                                        
"at org.elasticsearch.index.mapper.DocumentParser.createBuilderFromFieldType(DocumentParser.java:680) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                            
"at org.elasticsearch.index.mapper.DocumentParser.parseDynamicValue(DocumentParser.java:826) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                     
"at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:619) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                            
"at org.elasticsearch.index.mapper.DocumentParser.parseNonDynamicArray(DocumentParser.java:601) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                  
"at org.elasticsearch.index.mapper.DocumentParser.parseArray(DocumentParser.java:560) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                            
"at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:420) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                      
"at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:395) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                   
"at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:112) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                 
"at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:71) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                          
"at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:267) ~[elasticsearch-7.5.1.jar:7.5.1]",                                                                                                                                 
"at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:791) ~[elasticsearch-7.5.1.jar:7.5.1]",
"at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:768) ~[elasticsearch-7.5.1.jar:7.5.1]",
"at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:740) ~[elasticsearch-7.5.1.jar:7.5.1]",
[...]

这是以编程方式将文档添加到索引的方式:

es = Elasticsearch(hosts="localhost:9200")
es.index(index=self.index, doc_type=doc_type, body=document_data)

其中document_data是一个字典,保存着上面错误日志中显示的数据,包括:

{
  [...]
  "lda": [0.031139032915234566, ...]
}

索引是之前创建的,因此其中还没有文档。 我注意到,当我创建索引时,有这样的输出:

{"type": "server", "timestamp": "2020-01-03T08:40:03,280Z", "level": "INFO", "component": "o.e.c.m.MetaDataCreateIndexService", "cluster.name": "docker-cluster", "node.name": "8d468383f2cf", "message": "[documents] creating index, cause [api], 
templates [], shards [1]/[1], mappings [_doc]", "cluster.uuid": "7irLdTC_S7eXwYcVFolppQ", "node.id": "M_fMZ3KxQnWP3AiguV1_jA"  }                                                                                                                                                   
{"type": "deprecation", "timestamp": "2020-01-03T08:40:04,940Z", "level": "WARN", "component": "o.e.d.r.a.d.RestDeleteAction", "cluster.name": "docker-cluster", "node.name": "8d468383f2cf", "message": "[types removal] Specifying types in docume
nt index requests is deprecated, use the typeless endpoints instead (/{index}/_doc/{id}, /{index}/_doc, or /{index}/_create/{id}).", "cluster.uuid": "7irLdTC_S7eXwYcVFolppQ", "node.id": "M_fMZ3KxQnWP3AiguV1_jA"  }

索引是这样创建的:

    es = Elasticsearch(hosts="localhost:9200", serializer=BSONEncoder())
    es.indices.create(index="documents", body=mapping)

其中 mapping 包含定义映射的字典,如上面的输出所示:

mappings = {
  "mappings": {
    "properties": {
      [...],
      "lda": {
          "type": "dense_vector",
          "dims": 150
      },
    }
  }
}

更新: 我怀疑 mappings 确实是问题所在。在没有 lda 字段的情况下索引文档也会失败:

RequestError: RequestError(400, 'illegal_argument_exception', 'Rejecting mapping update to [documents] as the final mapping would have mo

因此,我编辑了映射以包含索引名称:

  "mappings": {
    "document": {    
      [...]
      "lda": {
        "type":"dense_vector",
        "dims":150
      }
    }
  }
} 

虽然这会导致空映射,但会在索引文档时推断出类型。

--- 结束更新---

我不知道从哪里进行调试。创建索引时的弃用警告似乎可能相关,但我不确定如何解决它。此外,错误消息似乎并没有真正表明这是问题所在。

documentation for the dense_vector type 没有透露很多细节。但是,此处显示的示例确实有效(使用 cURL 请求)。

通过 Python 与 cURL 方法创建索引的方式之间是否存在功能差异?

我怎样才能找出真正的错误信息是什么?维度通过 dims 属性.

明确定义

您正在使用不再支持的 ES 7.x doc_type -doc here - 它也写在从索引创建返回的消息中:

[types removal] Specifying types in docume
nt index requests is deprecated, use the typeless endpoints

但是您试图在映射中设置 doc_type

es.index(index=self.index, doc_type=doc_type, body=document_data)

从版本 7 开始,您只能将 _doc 设置为 doc_type,但您尝试设置自己的 - document。这会产生错误,并且您的映射被 elastic 拒绝:

RequestError: RequestError(400, 'illegal_argument_exception', 'Rejecting mapping update to [documents] as the final mapping would have more ...... (my add than one doc_type _doc, document)

要解决您的问题,您应该简单地尝试删除映射中的 doc_type -您的 doc_type var 或 mapping var during documents index creation