在 pyspark 数据帧上减少和 Lambda
Reduce and Lambda on pyspark dataframe
下面是来自 https://graphframes.github.io/graphframes/docs/_site/user-guide.html
的示例
我唯一感到困惑的是条件函数中“lit(0)”的目的
如果这个“lit(0)”意味着输入“cnt”?如果是,为什么它在 ["ab","bc","cd"] 之后?
from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import IntegerType
from graphframes.examples import Graphs
from functools import reduce
chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
chain4.show()
sumFriends = lambda cnt,relationship: when(relationship == "friend", cnt+1).otherwise(cnt)
condition = reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))
chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()
lit(0)
是 reduce
语句的 initializer。您需要用 cnt = 0
初始化 sumFriends
计数器才能开始计数。
condition = reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))
# should be equivalent to
condition = sumFriends(lit(0), col("ab").relationship)
condition = sumFriends(condition, col("bc").relationship)
condition = sumFriends(condition, col("cd").relationship)
下面是来自 https://graphframes.github.io/graphframes/docs/_site/user-guide.html
的示例我唯一感到困惑的是条件函数中“lit(0)”的目的 如果这个“lit(0)”意味着输入“cnt”?如果是,为什么它在 ["ab","bc","cd"] 之后?
from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import IntegerType
from graphframes.examples import Graphs
from functools import reduce
chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
chain4.show()
sumFriends = lambda cnt,relationship: when(relationship == "friend", cnt+1).otherwise(cnt)
condition = reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))
chainWith2Friends2 = chain4.where(condition >= 2)
chainWith2Friends2.show()
lit(0)
是 reduce
语句的 initializer。您需要用 cnt = 0
初始化 sumFriends
计数器才能开始计数。
condition = reduce(lambda cnt,e: sumFriends(cnt, col(e).relationship), ["ab", "bc", "cd"], lit(0))
# should be equivalent to
condition = sumFriends(lit(0), col("ab").relationship)
condition = sumFriends(condition, col("bc").relationship)
condition = sumFriends(condition, col("cd").relationship)