如何通过 NEST 搜索 ElasticSearch 并在自定义 属性 上过滤不同条目的结果

How to search ElasticSearch via NEST and filter results for distinct entries on custom Property

我在一个 .Net 应用程序中使用 NEST,除其他外,该应用程序跟踪位置并将它们存储在 ElasticSearch 中。这些 TrackedLocations(参见下面的简化模型)都有一个 UserId,并且每个 UserId 都有许多这样的索引 TrackedLocations。

现在我想要查找和查询的是给定 Lat/Lon 和半径组合附近的所有 TrackedLocations,但我只想要每个用户最近的一个......所以基本上执行 'distinct' 过滤 UserId,按 LocatedAtUtc 排序。

我当然可以获取所有文档并通过 Linq 等 post 处理/过滤这些文档,但如果 Nest/ES 可以在本机执行此操作,我当然更喜欢这种方式。

该查询的一个变体只是对这些不同实例的计数,例如..在任何给定的 lat/lon/radius?

中有多少(每个用户不同)

模特看起来与此相似:

public class TrackedLocation
{
    public Guid Id { get; set; }
    public Guid UserId { get; set; }
    public MyLocation Location { get; set; }
    public DateTime LocatedAtUtc { get; set; }
}

public class MyLocation
{
    public double Lat { get; set; }
    public double Lon { get; set; }
}

.. MyLocation 类型只是为了说明。

是否可以通过 ES / Nest 查询实现?如果可以,如何实现?

所以回答我自己的问题 - 在深入研究 ES 的聚合之后,我发现以下解决方案(通过 NEST)是最实用和精简的版本,它提供了我上面想要的:

var userIdsAggregationForLast24HoursAndLocation = elasticClient.Search<BlogPost>(postSearch => postSearch
                .Index(indexName)
                .MatchAll()
                .Source(false)
                .TrackScores(false)
                .Size(0)
                .Aggregations(aggregationDescriptor => aggregationDescriptor
                    .Filter("trackedLocationsFromThePast24HoursAtGivenLocation", descriptor => descriptor
                        .Filter(filterDescriptor => filterDescriptor
                            .And(
                                combinedFilter => combinedFilter
                                    .Range(dateRangeFilter => dateRangeFilter
                                        .GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
                                        .OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
                                combinedFilter => combinedFilter // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
                                    .GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
                                        .Distance(1, GeoUnit.Kilometers)
                                        .Location(37.809860, -122.476995)
                                        .Optimize(GeoOptimizeBBox.Indexed))))
                        .Aggregations(userIdAggregate => userIdAggregate.Terms("userIds", userIdTermsFilter => userIdTermsFilter
                            .Field(trackedLocation => trackedLocation.UserId)
                            .Size(100)))))); // get X distinct .UserIds

真正重要的是嵌套聚合:

  • 第一个是 .And() 组合过滤器聚合,其第一部分是 DateTime Range 过滤器(在这种情况下只有过去 24 小时相关),第二个是 GeoDistance 过滤器(我只想要一个实例给定位置)。
  • 上述 .Filter() 的第二个嵌套 .Aggregation 是提供实际 UserIds 的 .Terms()

原始示例请求如下所示:

{
  "size": 0,
  "track_scores": false,
  "_source": {
    "exclude": [
      "*"
    ]
  },
  "aggs": {
    "trackedLocationsFromThePast24HoursAtGivenLocation": {
      "filter": {
        "and": {
          "filters": [
            {
              "range": {
                "createdAtUtc": {
                  "gte": "2015-07-18T07:25:05.992"
                }
              }
            },
            {
              "geo_distance": {
                "location": "37.80986, -122.476995",
                "distance": 1.0,
                "unit": "km",
                "optimize_bbox": "indexed"
              }
            }
          ]
        }
      },
      "aggs": {
        "userIds": {
          "terms": {
            "field": "userId",
            "size": 100
          }
        }
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

.. 并且 ES 的原始响应是例如像这个:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 100,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "trackedLocationsFromThePast24HoursAtGivenLocation": {
      "doc_count": 12,
      "userIds": {
        "buckets": [
          {
            "key": "0a50c2b4-17f0-41bc-b380-f8fca8ca117c",
            "doc_count": 5
          },
          {
            "key": "6b59efd8-a1f9-43c4-86a1-8560b908705f",
            "doc_count": 5
          },
          {
            "key": "667fb1c9-4c9c-4570-8bc1-f61d72e4385f",
            "doc_count": 1
          },
          {
            "key": "73e93ec8-622b-42e3-8a1c-96a0a2b3b2b2",
            "doc_count": 1
          }
        ]
      }
    }
  }
}

如您所见,在此示例中,总共有 100 个跟踪位置,其中 12 个跟踪位置是在过去一天由总共 4 个不同的用户 (Id) 创建并编入索引的...两个创建了 5 个, 其他 2 个各创建了一个位置。

这 is/was 正是我所期望的。我并不真正关心分数或 sources/documents 本身,如上所述,我只关心落入过滤器的 TrackedLocations 以及那些我想要不同的 UserIds 列表的位置。

在@MartijnLaarman 's initial suggestion and after reading a bit more regarding Aggregations' 内存消耗之后,我决定尝试他建议的 Parent/Child 方法,这是我想要的相同结果.. 不使用聚合,而只是过滤 Parent/Child关系。

模型的设置与现在类似:

elasticClient.CreateIndex(indexName, descriptor => descriptor
    .NumberOfReplicas(0)
    .NumberOfShards(1)
    .AddMapping<User>(new RootObjectMapping // I use TTL for testing/dev purposes to auto-cleanup after me
    {
        AllFieldMapping = new AllFieldMapping { Enabled = false },
        TtlFieldMappingDescriptor = new TtlFieldMapping { Enabled = true, Default = "5m" }
    },
    userDescriptor => userDescriptor.MapFromAttributes())
    .AddMapping<TrackedLocation>(new RootObjectMapping // I use TTL for testing/dev purposes to auto-cleanup after me
    {
        AllFieldMapping = new AllFieldMapping { Enabled = false },
        TtlFieldMappingDescriptor = new TtlFieldMapping { Enabled = true, Default = "5m" }
    },
    trackedLocationDescriptor => trackedLocationDescriptor
            .MapFromAttributes()
            .Properties(propertiesDescriptor => propertiesDescriptor
                .GeoPoint(geoPointMappingDescriptor => geoPointMappingDescriptor.Name(post => post.Location).IndexLatLon()))
                .SetParent<User>())); // < that's the essential part right here to allow the filtered query below

在索引新的 TrackedLocation 实例时,我这样设置父级(用户):

elasticClient.Index(trackedLocation, descriptor => descriptor
                    .Index(indexName)
                    .Parent(parent.Id.ToString()));

实际过滤后的查询如下所示:

    var userIdsFilteredQueryForLast24HoursAndLocation = elasticClient.Search<User>(search => search
        .Index(indexName)
        .MatchAll()
        .Source(false)
        .TrackScores(false)
        .Filter(outerFilter => outerFilter.HasChild<TrackedLocation>(childFilterDescriptor => childFilterDescriptor
            .Filter(filterDescriptor => filterDescriptor
                .And(
                    andCombinedFilter1 => andCombinedFilter1
                        .Range(dateRangeFilter => dateRangeFilter
                            .GreaterOrEquals(DateTime.UtcNow.Subtract(TimeSpan.FromDays(1))) // filter on instances created/indexed in the past 24 hours
                            .OnField(trackedLocation => trackedLocation.CreatedAtUtc)),
                    andCombinedFilter2 => andCombinedFilter2 // and the second filter here is the GeoDistance one.. 1km away from a given .Location(...,...)
                        .GeoDistance(trackedLocation => trackedLocation.Location, geoDistanceFilterDescriptor => geoDistanceFilterDescriptor
                            .Distance(1, GeoUnit.Kilometers)
                            .Location(37.809860, -122.476995)
                            .Optimize(GeoOptimizeBBox.Indexed)))))));

所以原始请求看起来像这样:

{
  "track_scores": false,
  "_source": {
    "exclude": [
      "*"
    ]
  },
  "query": {
    "match_all": {}
  },
  "filter": {
    "has_child": {
      "type": "trackedlocation",
      "filter": {
        "and": {
          "filters": [
            {
              "range": {
                "createdAtUtc": {
                  "gte": "2015-07-18T08:58:02.664"
                }
              }
            },
            {
              "geo_distance": {
                "location": "37.80986, -122.476995",
                "distance": 1.0,
                "unit": "km",
                "optimize_bbox": "indexed"
              }
            }
          ]
        }
      }
    }
  }
}

搜索本身是针对与 .HasChild 过滤器相结合的用户实例。这又是与聚合相同的逻辑(按日期和位置)。

举个例子,原始响应如下所示:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "sampleindex",
        "_type": "user",
        "_id": "54ccbccd-eb2a-4a93-9be3-363b83cd3b21",
        "_score": 1.0,
        "_source": {}
      },
      {
        "_index": "locationtracking____sampleindex",
        "_type": "user",
        "_id": "42482b3b-d4c7-4a92-bf59-a4c25d707835",
        "_score": 1.0,
        "_source": {}
      }
    ]
  }
}

.. 其中 returns 过去一天在给定位置具有 TrackedLocations 的用户的(正确)用户(Id)点击集。完美!

我现在将坚持使用这个解决方案来解决聚合问题。它出现在 ES 中 parent/child 关系的 "cost" 处,但总的来说它看起来更合乎逻辑。