Spark 更简单 value_counts
Spark simpler value_counts
类似于 的东西可以让我在 Spark 中模拟 df.series.value_counts()
Pandas 的功能:
The resulting object will be in descending order so that the first
element is the most frequently-occurring element. Excludes NA values
by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
我很好奇在 Spark 中是否可以更好/更简单地实现数据帧。
这只是一个基本的聚合,不是吗?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
火花SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
如果您想包含额外的分组列(如 "key"),只需将它们放在 groupBy
:
df.groupBy($"key", $"value").count.orderBy($"count".desc)
类似于 df.series.value_counts()
Pandas 的功能:
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
我很好奇在 Spark 中是否可以更好/更简单地实现数据帧。
这只是一个基本的聚合,不是吗?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
火花SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
如果您想包含额外的分组列(如 "key"),只需将它们放在 groupBy
:
df.groupBy($"key", $"value").count.orderBy($"count".desc)