Spark 更简单 value_counts

Question

类似于的东西可以让我在 Spark 中模拟 df.series.value_counts() Pandas 的功能：

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)

我很好奇在 Spark 中是否可以更好/更简单地实现数据帧。

Answer 1

这只是一个基本的聚合，不是吗？

df.groupBy($"value").count.orderBy($"count".desc)

Pandas:

import pandas as pd

pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()

2    3
3    2
4    1
1    1
dtype: int64

火花SQL:

Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
  .groupBy($"value").count.orderBy($"count".desc)

+-----+-----+
|value|count|
+-----+-----+
|    2|    3|
|    3|    2|
|    1|    1|
|    4|    1|
+-----+-----+

如果您想包含额外的分组列（如 "key"），只需将它们放在 groupBy:

df.groupBy($"key", $"value").count.orderBy($"count".desc)

Spark 更简单 value_counts

Spark simpler value_counts

apache-spark

apache-spark-sql

apache-spark-dataset