如何在 elastic4s 和 elasticsearch 中实现 PatternAnalyzer 以排除具有特定字段的结果
How do I implement a PatternAnalyzer in elastic4s and elasticsearch to exclude result with a certain field
我正在尝试对我的索引执行查询,并获取所有没有带有 Gravatar 图片的评论者的评论。为此,我实现了一个带有主机模式的 PatternAnalyzerDefinition:
"^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)"
应该匹配并提取 url 主机,例如:
https://www.gravatar.com/avatar/blablalbla?s=200&r=pg&d=mm
变成:
www.gravatar.com
映射:
clientProvider.getClient.execute {
create.index(_index).analysis(
phraseAnalyzer,
PatternAnalyzerDefinition("host_pattern", regex = "^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)")
).mappings(
"reviews" as (
.... Cool mmappings
"review" inner (
"grade" typed LongType,
"text" typed StringType index "not_analyzed",
"reviewer" inner (
"screenName" typed StringType index "not_analyzed",
"profilePicture" typed StringType analyzer "host_pattern",
"thumbPicture" typed StringType index "not_analyzed",
"points" typed LongType index "not_analyzed"
),
.... Other cool mmappings
)
) all(false)
} map { response =>
Logger.info("Create index response: {}", response)
} recover {
case t: Throwable => play.Logger.error("Error creating index: ", t)
}
查询:
val reviewQuery = (search in path)
.query(
bool(
must(
not(
termQuery("review.reviewer.profilePicture", "www.gravatar.com")
)
)
)
)
.postFilter(
bool(
must(
rangeFilter("review.grade") from 3
)
)
)
.size(size)
.sort(by field "review.created" order SortOrder.DESC)
clientProvider.getClient.execute {
reviewQuery
}.map(_.getHits.jsonToList[ReviewData])
检查映射的索引:
reviewer: {
properties: {
id: {
type: "long"
},
points: {
type: "long"
},
profilePicture: {
type: "string",
analyzer: "host_pattern"
},
screenName: {
type: "string",
index: "not_analyzed"
},
state: {
type: "string"
},
thumbPicture: {
type: "string",
index: "not_analyzed"
}
}
}
当我执行查询时,模式匹配似乎不起作用。我仍然会收到带有头像的评论者的评论。
我究竟做错了什么?也许我误解了 PatternAnalyzer?
我正在使用
"com.sksamuel.elastic4s" %% "elastic4s" % "1.5.9",
我想 RTFM 又一次符合要求:
docs 状态:
重要提示:正则表达式应该匹配标记分隔符,而不是标记本身。
意味着在我的例子中,匹配的标记 www.gravatar.com 不会
字段解析后的部分token
而是使用 Pattern Capture Token Filter
首先声明一个新的 CustomAnalyzerDefinition:
val hostAnalyzer = CustomAnalyzerDefinition(
"host_analyzer",
StandardTokenizer,
PatternCaptureTokenFilter(
name = "hostFilter",
patterns = List[String]("^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)"),
preserveOriginal = false
)
)
然后将分析器添加到字段中:
"review" inner (
"reviewer" inner (
"screenName" typed StringType index "not_analyzed",
"profilePicture" typed StringType analyzer "hostAnalyzer",
"thumbPicture" typed StringType index "not_analyzed",
"points" typed LongType index "not_analyzed"
)
)
create.index(_index).analysis(
someAnalyzer,
phraseAnalyzer,
hostAnalyzer
).mappings(
瞧。有用。一个非常好的检查令牌和索引的工具正在调用:
/[index]/[collection]/[id]/_termvector?fields=review.reviewer.profilePicture&pretty=true
我正在尝试对我的索引执行查询,并获取所有没有带有 Gravatar 图片的评论者的评论。为此,我实现了一个带有主机模式的 PatternAnalyzerDefinition:
"^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)"
应该匹配并提取 url 主机,例如:
https://www.gravatar.com/avatar/blablalbla?s=200&r=pg&d=mm
变成:
www.gravatar.com
映射:
clientProvider.getClient.execute {
create.index(_index).analysis(
phraseAnalyzer,
PatternAnalyzerDefinition("host_pattern", regex = "^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)")
).mappings(
"reviews" as (
.... Cool mmappings
"review" inner (
"grade" typed LongType,
"text" typed StringType index "not_analyzed",
"reviewer" inner (
"screenName" typed StringType index "not_analyzed",
"profilePicture" typed StringType analyzer "host_pattern",
"thumbPicture" typed StringType index "not_analyzed",
"points" typed LongType index "not_analyzed"
),
.... Other cool mmappings
)
) all(false)
} map { response =>
Logger.info("Create index response: {}", response)
} recover {
case t: Throwable => play.Logger.error("Error creating index: ", t)
}
查询:
val reviewQuery = (search in path)
.query(
bool(
must(
not(
termQuery("review.reviewer.profilePicture", "www.gravatar.com")
)
)
)
)
.postFilter(
bool(
must(
rangeFilter("review.grade") from 3
)
)
)
.size(size)
.sort(by field "review.created" order SortOrder.DESC)
clientProvider.getClient.execute {
reviewQuery
}.map(_.getHits.jsonToList[ReviewData])
检查映射的索引:
reviewer: {
properties: {
id: {
type: "long"
},
points: {
type: "long"
},
profilePicture: {
type: "string",
analyzer: "host_pattern"
},
screenName: {
type: "string",
index: "not_analyzed"
},
state: {
type: "string"
},
thumbPicture: {
type: "string",
index: "not_analyzed"
}
}
}
当我执行查询时,模式匹配似乎不起作用。我仍然会收到带有头像的评论者的评论。 我究竟做错了什么?也许我误解了 PatternAnalyzer?
我正在使用 "com.sksamuel.elastic4s" %% "elastic4s" % "1.5.9",
我想 RTFM 又一次符合要求:
docs 状态:
重要提示:正则表达式应该匹配标记分隔符,而不是标记本身。
意味着在我的例子中,匹配的标记 www.gravatar.com 不会 字段解析后的部分token
而是使用 Pattern Capture Token Filter
首先声明一个新的 CustomAnalyzerDefinition:
val hostAnalyzer = CustomAnalyzerDefinition(
"host_analyzer",
StandardTokenizer,
PatternCaptureTokenFilter(
name = "hostFilter",
patterns = List[String]("^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)"),
preserveOriginal = false
)
)
然后将分析器添加到字段中:
"review" inner (
"reviewer" inner (
"screenName" typed StringType index "not_analyzed",
"profilePicture" typed StringType analyzer "hostAnalyzer",
"thumbPicture" typed StringType index "not_analyzed",
"points" typed LongType index "not_analyzed"
)
)
create.index(_index).analysis(
someAnalyzer,
phraseAnalyzer,
hostAnalyzer
).mappings(
瞧。有用。一个非常好的检查令牌和索引的工具正在调用:
/[index]/[collection]/[id]/_termvector?fields=review.reviewer.profilePicture&pretty=true