使用Cassandra统计一个大数据列表

Question

我们正在使用 Cassandra 计算各种分析指标，按帐户和日期细分，这似乎运行良好：

SELECT COUNT(page_impressions) FROM analytics WHERE account='abc' and MINUTE > '2015-01-01 00:00:00';

我们想按域进一步细分此数据，这会导致问题。在一个月左右的时间里，某些帐户的可能域名数量运行达到数百万，我们对 'top' 域名最感兴趣，这意味着我们想按 page_impressions 字段.

是否有人可以指导我如何按域计数和按总页面展示次数排序？

谢谢！

Answer 1

Cassandra 支持 counters，这对于在单独的 table.

中创建顶级域列表很有用

您可能也有兴趣将 presto or spark 等分析引擎与 cassandra 一起使用，因为将您的数据模型用于不同的分析用例通常不太实用。

Answer 2

正如 Stefan 所说，我肯定会推荐 Spark 进行此类分析。另外，如果可能的话，请确保不要对 Top N 查询实际运行进行排序。这些通常可以在没有像

这样的函数排序所需的洗牌的情况下完成

http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.rdd.RDD

takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and maintains the ordering. This does the opposite of top. For example:

sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
// returns Array(2)

sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
// returns Array(2, 3)
num
k, the number of elements to return
ord
the implicit ordering for T
returns
an array of top elements

和

top(num: Int)(implicit ord: Ordering[T]): Array[T]
Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T].

使用Cassandra统计一个大数据列表

Using Cassandra to count a big list of data

sorting

counter

cql

cassandra

nosql