对 Spark 中的多列求和
Summing multiple columns in Spark
如何在 Spark 中对多列求和?例如,在 SparkR 中,以下代码用于获取一列的总和,但如果我尝试获取 df
中两列的总和,则会出现错误。
# Create SparkDataFrame
df <- createDataFrame(faithful)
# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works
# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work
SparkR 或 PySpark 代码都可以。
您可以在 pyspark 中执行如下操作
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([("a",1,10), ("b",2,20), ("c",3,30), ("d",4,40)], ["col1", "col2", "col3"])
>>> df.groupBy("col1").agg(F.sum(df.col2+df.col3)).show()
+----+------------------+
|col1|sum((col2 + col3))|
+----+------------------+
| d| 44|
| c| 33|
| b| 22|
| a| 11|
+----+------------------+
org.apache.spark.sql.functions.sum(Column e)
Aggregate function: returns the sum of all values in the expression.
如您所见,sum
仅将一列作为输入,因此 sum(df$waiting, df$eruptions)
不会 work.Since 您想要对数字字段求和,您可以这样做 sum(df("waiting") + df("eruptions"))
.如果您想对各个列的值求和,则可以 df.agg(sum(df$waiting),sum(df$eruptions)).show
sparkR 代码:
library(SparkR)
df <- createDataFrame(sqlContext,faithful)
w<-agg(df,sum(df$waiting)),agg(df,sum(df$eruptions))
head(w[[1]])
head(w[[2]])
对于 PySpark,如果您不想显式键入列:
from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))
如何在 Spark 中对多列求和?例如,在 SparkR 中,以下代码用于获取一列的总和,但如果我尝试获取 df
中两列的总和,则会出现错误。
# Create SparkDataFrame
df <- createDataFrame(faithful)
# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works
# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work
SparkR 或 PySpark 代码都可以。
您可以在 pyspark 中执行如下操作
>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([("a",1,10), ("b",2,20), ("c",3,30), ("d",4,40)], ["col1", "col2", "col3"])
>>> df.groupBy("col1").agg(F.sum(df.col2+df.col3)).show()
+----+------------------+
|col1|sum((col2 + col3))|
+----+------------------+
| d| 44|
| c| 33|
| b| 22|
| a| 11|
+----+------------------+
org.apache.spark.sql.functions.sum(Column e)
Aggregate function: returns the sum of all values in the expression.
如您所见,sum
仅将一列作为输入,因此 sum(df$waiting, df$eruptions)
不会 work.Since 您想要对数字字段求和,您可以这样做 sum(df("waiting") + df("eruptions"))
.如果您想对各个列的值求和,则可以 df.agg(sum(df$waiting),sum(df$eruptions)).show
sparkR 代码:
library(SparkR)
df <- createDataFrame(sqlContext,faithful)
w<-agg(df,sum(df$waiting)),agg(df,sum(df$eruptions))
head(w[[1]])
head(w[[2]])
对于 PySpark,如果您不想显式键入列:
from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))