如何在 elastic4s 和 elasticsearch 中实现 PatternAnalyzer 以排除具有特定字段的结果

How do I implement a PatternAnalyzer in elastic4s and elasticsearch to exclude result with a certain field

我正在尝试对我的索引执行查询,并获取所有没有带有 Gravatar 图片的评论者的评论。为此,我实现了一个带有主机模式的 PatternAnalyzerDefinition:

"^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)"

应该匹配并提取 url 主机,例如:

https://www.gravatar.com/avatar/blablalbla?s=200&r=pg&d=mm

变成:

www.gravatar.com

映射:

clientProvider.getClient.execute {
          create.index(_index).analysis(
            phraseAnalyzer,
            PatternAnalyzerDefinition("host_pattern", regex = "^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)")
          ).mappings(
"reviews" as (
             .... Cool mmappings
              "review" inner (
                "grade" typed LongType,
                "text" typed StringType index "not_analyzed",
                "reviewer" inner (
                  "screenName" typed StringType index "not_analyzed",
                  "profilePicture" typed StringType analyzer "host_pattern",
                  "thumbPicture" typed StringType index "not_analyzed",
                  "points" typed LongType index "not_analyzed"
                ),                    
               .... Other cool mmappings                    
              )
            ) all(false)
} map { response =>
      Logger.info("Create index response: {}", response)
    } recover {
      case t: Throwable => play.Logger.error("Error creating index: ", t)
    }

查询:

val reviewQuery = (search in path)
      .query(
        bool(
          must(
            not(
              termQuery("review.reviewer.profilePicture", "www.gravatar.com")
            )
          )
        )
      )
      .postFilter(
        bool(
          must(
            rangeFilter("review.grade") from 3
          )
        )
      )
      .size(size)
      .sort(by field "review.created" order SortOrder.DESC)

    clientProvider.getClient.execute {      
      reviewQuery
    }.map(_.getHits.jsonToList[ReviewData])

检查映射的索引:

reviewer: {
    properties: {
        id: {
            type: "long"
        },
        points: {
            type: "long"
        },
        profilePicture: {
            type: "string",
            analyzer: "host_pattern"
        },
        screenName: {
            type: "string",
            index: "not_analyzed"
        },
        state: {
            type: "string"
        },
        thumbPicture: {
            type: "string",
            index: "not_analyzed"
        }
    }
}

当我执行查询时,模式匹配似乎不起作用。我仍然会收到带有头像的评论者的评论。 我究竟做错了什么?也许我误解了 PatternAnalyzer?

我正在使用 "com.sksamuel.elastic4s" %% "elastic4s" % "1.5.9",

我想 RTFM 又一次符合要求:

docs 状态:

重要提示:正则表达式应该匹配标记分隔符,而不是标记本身。

意味着在我的例子中,匹配的标记 www.gravatar.com 不会 字段解析后的部分token

而是使用 Pattern Capture Token Filter

首先声明一个新的 CustomAnalyzerDefinition:

val hostAnalyzer = CustomAnalyzerDefinition(
    "host_analyzer",
    StandardTokenizer,
    PatternCaptureTokenFilter(
      name = "hostFilter",
      patterns = List[String]("^https?\:\/\/([^\/?#]+)(?:[\/?#]|$)"),
      preserveOriginal = false
    )
  )

然后将分析器添加到字段中:

"review" inner (              
                "reviewer" inner (
                  "screenName" typed StringType index "not_analyzed",
                  "profilePicture" typed StringType analyzer "hostAnalyzer",
                  "thumbPicture" typed StringType index "not_analyzed",
                  "points" typed LongType index "not_analyzed"
                )
)

create.index(_index).analysis(
            someAnalyzer,
            phraseAnalyzer,
            hostAnalyzer
          ).mappings(

瞧。有用。一个非常好的检查令牌和索引的工具正在调用:

/[index]/[collection]/[id]/_termvector?fields=review.reviewer.profilePicture&pretty=true