Elasticsearch 嵌套了更多类似这个查询
Elasticsearch Nested More Like This Query
是否可以执行 More Like This 查询 (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html) on text inside a nested datatype (https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html)?
我想查询的文档(我无法控制它的格式,因为数据归另一方所有)看起来像这样:
{
"communicationType": "Email",
"timestamp": 1497633308917,
"textFields": [
{
"field": "Subject",
"text": "This is the subject of the email"
},
{
"field": "To",
"text": "to-email@domain.com"
},
{
"field": "Body",
"text": "This is the body of the email"
}
]
}
我想对电子邮件正文执行“更喜欢这个”查询。以前,文件看起来像这样:
{
"communicationType": "Email",
"timestamp": 1497633308917,
"textFields": {
"subject": "This is the subject of the email",
"to: "to-email@domain.com",
"body": "This is the body of the email"
}
}
而且我能够像这样对电子邮件正文执行 More Like This 查询:
{
"query": {
"more_like_this": {
"fields": ["textFields.body"],
"like": "This is a similar body of an email",
"min_term_freq": 1
},
"bool": {
"filter": [
{ "term": { "communicationType": "Email" } },
{ "range": { "timestamp": { "gte": 1497633300000 } } }
]
}
}
}
但现在数据源已被弃用,我需要能够对具有嵌套数据类型的电子邮件正文的新数据源执行等效查询。我只想将文本与 "header" 为 "Body" 的 "text" 字段进行比较。
这可能吗?如果是这样,查询会是什么样子?与之前在非嵌套文档上执行查询相比,在嵌套数据类型上执行查询是否会对性能产生重大影响?即使在应用时间戳和通信类型过滤器之后,仍然会有数千万个文档,每个查询都需要将类似的文本与之进行比较,因此性能很重要。
实际上,事实证明在嵌套查询中使用 More Like This 查询非常简单:
{
"query": {
"bool": {
"must": {
"nested": {
"path": "textFields",
"query": {
"bool": {
"must": {
"more_like_this": {
"fields": ["textFields.text"],
"like_text": "This is a similar body of an email",
"min_term_freq": 1
}
},
"filter": {
"term": { "textFields.field": "Body" }
}
}
}
}
},
"filter": [
{
"term": {
"communicationType": "Email"
}
},
{
"range": {
"timestamp": {
"gte": 1497633300000
}
}
}
]
}
},
"min_score": 2
}
是否可以执行 More Like This 查询 (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html) on text inside a nested datatype (https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html)?
我想查询的文档(我无法控制它的格式,因为数据归另一方所有)看起来像这样:
{
"communicationType": "Email",
"timestamp": 1497633308917,
"textFields": [
{
"field": "Subject",
"text": "This is the subject of the email"
},
{
"field": "To",
"text": "to-email@domain.com"
},
{
"field": "Body",
"text": "This is the body of the email"
}
]
}
我想对电子邮件正文执行“更喜欢这个”查询。以前,文件看起来像这样:
{
"communicationType": "Email",
"timestamp": 1497633308917,
"textFields": {
"subject": "This is the subject of the email",
"to: "to-email@domain.com",
"body": "This is the body of the email"
}
}
而且我能够像这样对电子邮件正文执行 More Like This 查询:
{
"query": {
"more_like_this": {
"fields": ["textFields.body"],
"like": "This is a similar body of an email",
"min_term_freq": 1
},
"bool": {
"filter": [
{ "term": { "communicationType": "Email" } },
{ "range": { "timestamp": { "gte": 1497633300000 } } }
]
}
}
}
但现在数据源已被弃用,我需要能够对具有嵌套数据类型的电子邮件正文的新数据源执行等效查询。我只想将文本与 "header" 为 "Body" 的 "text" 字段进行比较。
这可能吗?如果是这样,查询会是什么样子?与之前在非嵌套文档上执行查询相比,在嵌套数据类型上执行查询是否会对性能产生重大影响?即使在应用时间戳和通信类型过滤器之后,仍然会有数千万个文档,每个查询都需要将类似的文本与之进行比较,因此性能很重要。
实际上,事实证明在嵌套查询中使用 More Like This 查询非常简单:
{
"query": {
"bool": {
"must": {
"nested": {
"path": "textFields",
"query": {
"bool": {
"must": {
"more_like_this": {
"fields": ["textFields.text"],
"like_text": "This is a similar body of an email",
"min_term_freq": 1
}
},
"filter": {
"term": { "textFields.field": "Body" }
}
}
}
}
},
"filter": [
{
"term": {
"communicationType": "Email"
}
},
{
"range": {
"timestamp": {
"gte": 1497633300000
}
}
}
]
}
},
"min_score": 2
}