用户为 spark 数据帧定义聚合函数 (python)
User defined aggregation function for spark dataframe (python)
我有下面的 spark 数据框,其中 id 是 int,attributes 是一个字符串列表
id | attributes
1 | ['a','c', 'd']
2 | ['a', 'e']
1 | ['e', 'f']
1 | ['g']
3 | ['a', 'b']
2 | ['e', 'g']
我需要执行聚合,其中连接每个 ID 的属性列表。聚合结果为:
id | concat(attributes)
1 | ['a', 'c', 'd', 'e', 'f', 'g']
2 | ['a', 'e', 'e', 'g']
3 | ['a', 'b']
有没有办法使用 python 实现此目的?
谢谢。
一种方法是使用 reduceByKey 创建一个新框架:
>>> df.show()
+---+----------+
| id|attributes|
+---+----------+
| 1| [a, c, d]|
| 2| [a, e]|
| 1| [e, f]|
| 1| [g]|
| 3| [a, b]|
| 2| [e, g]|
+---+----------+
>>> custom_list = df.rdd.reduceByKey(lambda x,y:x+y).collect()
>>> new_df = sqlCtx.createDataFrame(custom_list, ["id", "attributes"])
>>> new_df.show()
+---+------------------+
| id| attributes|
+---+------------------+
| 1|[a, c, d, e, f, g]|
| 2| [a, e, e, g]|
| 3| [a, b]|
+---+------------------+
reduceByKey(func, [numTasks]):
When called on a dataset of (K, V)
pairs, returns a dataset of (K, V) pairs where the values for each key
are aggregated using the given reduce function func, which must be of
type (V,V) => V. Like in groupByKey, the number of reduce tasks is
configurable through an optional second argument.
我有下面的 spark 数据框,其中 id 是 int,attributes 是一个字符串列表
id | attributes
1 | ['a','c', 'd']
2 | ['a', 'e']
1 | ['e', 'f']
1 | ['g']
3 | ['a', 'b']
2 | ['e', 'g']
我需要执行聚合,其中连接每个 ID 的属性列表。聚合结果为:
id | concat(attributes)
1 | ['a', 'c', 'd', 'e', 'f', 'g']
2 | ['a', 'e', 'e', 'g']
3 | ['a', 'b']
有没有办法使用 python 实现此目的?
谢谢。
一种方法是使用 reduceByKey 创建一个新框架:
>>> df.show()
+---+----------+
| id|attributes|
+---+----------+
| 1| [a, c, d]|
| 2| [a, e]|
| 1| [e, f]|
| 1| [g]|
| 3| [a, b]|
| 2| [e, g]|
+---+----------+
>>> custom_list = df.rdd.reduceByKey(lambda x,y:x+y).collect()
>>> new_df = sqlCtx.createDataFrame(custom_list, ["id", "attributes"])
>>> new_df.show()
+---+------------------+
| id| attributes|
+---+------------------+
| 1|[a, c, d, e, f, g]|
| 2| [a, e, e, g]|
| 3| [a, b]|
+---+------------------+
reduceByKey(func, [numTasks]):
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.