如何return按时间戳分组的最新记录?

how to return the latest records by timestamp grouped by key?

我有一个类似于此的数据集:

{"user":333,"product":943, "rating":2.025743791177902, "timestamp":1481675659}
{"user":333,"product":3074,"rating":2.1070657532324493,"timestamp":1481675178}
{"user":333,"product":3074,"rating":2.108323259636257, "timestamp":1481673546}
{"user":333,"product":943, "rating":2.0211849667268353,"timestamp":1481675178}
{"user":333,"product":943, "rating":2.041045323231024, "timestamp":1481673546}
{"user":333,"product":119, "rating":2.1832303461543163,"timestamp":1481675659}
{"user":333,"product":119, "rating":2.1937538029700203,"timestamp":1481673546}
{"user":111,"product":123, ...

我想查询一个用户的所有记录(例如 333),但只查询每个产品的最新时间戳 return。例如。根据以上数据,查询将 return:

{"user":333,"product":119, "rating":2.1832303461543163,"timestamp":1481675659}     
{"user":333,"product":3074,"rating":2.1070657532324493,"timestamp":1481675178}
{"user":333,"product":943, "rating":2.025743791177902, "timestamp":1481675659}

等效的 sql 查询看起来 'something' 如下:

SELECT * FROM recommendations L
LEFT JOIN recommendations R ON
          L.user = R.user AND
          L.product = R.product AND
          L.timestamp < r.timestamp
WHERE isnull(r.user) and isnull(r.product)

map/reduce 索引是否可行?如果是这样,如何?如果没有,是否有替代方法,例如 lucene 索引?

理想情况下我也想按评级值排序。

Cloudant/CouchDB MapReduce 可以为复合键生成聚合 counts/sums/stats,例如

  • 按用户和产品分组的条目数
  • 按用户和产品分组的平均评分

但它不能 return 您 "the latest rating" 按用户和产品分组。

基于 Lucene 的索引也无济于事。它将允许在 window 时间内选择数据,例如"get my user ratings between timestamp X and timestamp Y that belong to user Z" 但由于基于 Lucene 的索引没有聚合函数,因此在您的应用程序中仍有工作要做。

另一种解决方案是将数据导出到 DashDB 等数据仓库解决方案,并在那里执行聚合 SQL 查询。