Elasticsearch:如何存储术语向量
Elasticsearch: How to store term vectors
我正在从事一个项目,我大量使用 Elasticsearch 并利用 moreLikeThis 查询来实现一些功能。
MLT 查询的官方文档说明如下:
In order to speed up analysis, it could help to store term vectors at
index time, but at the expense of disk usage.
在**工作原理*部分。现在的想法是调整映射,以便存储预先计算的术语向量。问题是从文档中似乎不清楚应该如何完成。一方面,在 MLT 文档中,他们提供了如下所示的示例映射:
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
"mappings": {
"movies": {
"properties": {
"title": {
"type": "string",
"term_vector": "yes"
},
"description": {
"type": "string"
},
"tags": {
"type": "string",
"fields" : {
"raw": {
"type" : "string",
"index" : "not_analyzed",
"term_vector" : "yes"
}
}
}
}
}
}
}
另一方面,在 Term Vectors documentation 中,他们在 示例 1 部分中提供了一个映射,如下所示
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"index_analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"index_analyzer" : "fulltext_analyzer"
}
}
}
....
这应该create an index that stores term vectors, payloads etc.
现在的问题是:应该使用哪个映射?是文档中的缺陷还是我遗漏了什么?
你是对的,当前版本的文档中似乎没有明确提及,但是在即将发布的版本中2.0 documents会有更详细的解释。
Term vectors contain information about the terms produced by the
analysis process, including:
- a list of terms.
- the position (or order) of each term.
- the start and end character offsets mapping the term to its origin in the original string.
These term vectors can be stored so that they can be retrieved for a
particular document.
The term_vector
setting accepts:
no
: No term vectors are stored. (default)
yes
: Just the terms in the field are stored
with_positions
: Terms and positions are stored
with_offsets
: Terms and character offsets are stored
with_positions_offsets
: Terms, positions, and character offsets are stored
我正在从事一个项目,我大量使用 Elasticsearch 并利用 moreLikeThis 查询来实现一些功能。 MLT 查询的官方文档说明如下:
In order to speed up analysis, it could help to store term vectors at index time, but at the expense of disk usage.
在**工作原理*部分。现在的想法是调整映射,以便存储预先计算的术语向量。问题是从文档中似乎不清楚应该如何完成。一方面,在 MLT 文档中,他们提供了如下所示的示例映射:
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
"mappings": {
"movies": {
"properties": {
"title": {
"type": "string",
"term_vector": "yes"
},
"description": {
"type": "string"
},
"tags": {
"type": "string",
"fields" : {
"raw": {
"type" : "string",
"index" : "not_analyzed",
"term_vector" : "yes"
}
}
}
}
}
}
}
另一方面,在 Term Vectors documentation 中,他们在 示例 1 部分中提供了一个映射,如下所示
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"index_analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"index_analyzer" : "fulltext_analyzer"
}
}
}
....
这应该create an index that stores term vectors, payloads etc.
现在的问题是:应该使用哪个映射?是文档中的缺陷还是我遗漏了什么?
你是对的,当前版本的文档中似乎没有明确提及,但是在即将发布的版本中2.0 documents会有更详细的解释。
Term vectors contain information about the terms produced by the analysis process, including:
- a list of terms.
- the position (or order) of each term.
- the start and end character offsets mapping the term to its origin in the original string.
These term vectors can be stored so that they can be retrieved for a particular document.
The
term_vector
setting accepts:
no
: No term vectors are stored. (default)yes
: Just the terms in the field are storedwith_positions
: Terms and positions are storedwith_offsets
: Terms and character offsets are storedwith_positions_offsets
: Terms, positions, and character offsets are stored