计算 DataFrame 的标准偏差会导致错误
Calculating Standard Deviation of a DataFrame results in an Error
我正在尝试计算 DataFrame 中列的标准偏差,但在尝试时我收到如下失败消息:
[info] - should return the standard deviation for all columns in a DataFrame *** FAILED *** (51 milliseconds)
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`value_6`' given input columns: [stddev_samp(value_6)];
[info] 'Project ['value_6]
[info] +- Aggregate [stddev_samp(value_6#131) AS stddev_samp(value_6)#151]
[info] +- Project [coalesce(nanvl(value_6#60, cast(null as double)), cast(0 as double)) AS value_6#131]
[info] +- Project [value_6#60]
[info] +- Project [_1#39 AS id#54, _2#40 AS value_1#55, _3#41 AS value_2#56, _4#42 AS value_3#57, _5#43 AS value_4#58, _6#44 AS value_5#59, _7#45 AS value_6#60]
[info] +- LocalRelation [_1#39, _2#40, _3#41, _4#42, _5#43, _6#44, _7#45]
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis.applyOrElse(CheckAnalysis.scala:155)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis.applyOrElse(CheckAnalysis.scala:152)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp(TreeNode.scala:341)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:341)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp(QueryPlan.scala:104)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions(QueryPlan.scala:116)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression(QueryPlan.scala:116)
这是我的资料:
def standardDeviationForColumn(df: DataFrame, columnName: String): DataFrame =
df.select(columnName).na.fill(0).agg(stddev(columnName))
我是这样称呼它的:
assert(DataFrameUtils.standardDeviationForColumn(randomNumericTestData, "value_6").select("value_6").first().getDouble(0) === 1, "TODO")
我做错了什么?这是我的数据框:
val randomNumericTestData: DataFrame = Seq(
(1, 1, 10.0, 10.0,10.0,10.0,10.0),
(2, 0, 12.0, 12.0,12.0,12.0,12.0),
(3, 1, 13.0, 13.0,13.0,13.0,13.0),
(4, 1, 14.0, 14.0,14.0,14.0,14.0),
(5, 0, 12.5, 12.5,12.5,12.5,12.5),
(6, 1, 11.5, 11.5,11.5,11.5,11.5),
(7, 0, 17.5, 17.5,17.5,17.5,17.5),
(8, 0, 13.6, 13.6,13.6,13.6,13.6),
(9, 1, 14.2, 14.2,14.2,14.2,14.2)
).toDF("id", "value_1", "value_2", "value_3", "value_4", "value_5", "value_6")
线索在错误消息中:org.apache.spark.sql.AnalysisException: cannot resolve 'value_6' given input columns: [stddev_samp(value_6)];
。 Spark 找不到名称为 value_6
.
的列
当您调用 ....agg(stddev(columnName))
时,您会在输出 DataFrame
中得到一个名为 stddev(columnName)
的新列。您需要重命名聚合列:
def standardDeviationForColumn(df: DataFrame, columnName: String): DataFrame =
df.select(columnName).na.fill(0).agg(stddev(columnName) as columnName)
我正在尝试计算 DataFrame 中列的标准偏差,但在尝试时我收到如下失败消息:
[info] - should return the standard deviation for all columns in a DataFrame *** FAILED *** (51 milliseconds)
[info] org.apache.spark.sql.AnalysisException: cannot resolve '`value_6`' given input columns: [stddev_samp(value_6)];
[info] 'Project ['value_6]
[info] +- Aggregate [stddev_samp(value_6#131) AS stddev_samp(value_6)#151]
[info] +- Project [coalesce(nanvl(value_6#60, cast(null as double)), cast(0 as double)) AS value_6#131]
[info] +- Project [value_6#60]
[info] +- Project [_1#39 AS id#54, _2#40 AS value_1#55, _3#41 AS value_2#56, _4#42 AS value_3#57, _5#43 AS value_4#58, _6#44 AS value_5#59, _7#45 AS value_6#60]
[info] +- LocalRelation [_1#39, _2#40, _3#41, _4#42, _5#43, _6#44, _7#45]
[info] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis.applyOrElse(CheckAnalysis.scala:155)
[info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis.applyOrElse(CheckAnalysis.scala:152)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp(TreeNode.scala:341)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:341)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp(QueryPlan.scala:104)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions(QueryPlan.scala:116)
[info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression(QueryPlan.scala:116)
这是我的资料:
def standardDeviationForColumn(df: DataFrame, columnName: String): DataFrame =
df.select(columnName).na.fill(0).agg(stddev(columnName))
我是这样称呼它的:
assert(DataFrameUtils.standardDeviationForColumn(randomNumericTestData, "value_6").select("value_6").first().getDouble(0) === 1, "TODO")
我做错了什么?这是我的数据框:
val randomNumericTestData: DataFrame = Seq(
(1, 1, 10.0, 10.0,10.0,10.0,10.0),
(2, 0, 12.0, 12.0,12.0,12.0,12.0),
(3, 1, 13.0, 13.0,13.0,13.0,13.0),
(4, 1, 14.0, 14.0,14.0,14.0,14.0),
(5, 0, 12.5, 12.5,12.5,12.5,12.5),
(6, 1, 11.5, 11.5,11.5,11.5,11.5),
(7, 0, 17.5, 17.5,17.5,17.5,17.5),
(8, 0, 13.6, 13.6,13.6,13.6,13.6),
(9, 1, 14.2, 14.2,14.2,14.2,14.2)
).toDF("id", "value_1", "value_2", "value_3", "value_4", "value_5", "value_6")
线索在错误消息中:org.apache.spark.sql.AnalysisException: cannot resolve 'value_6' given input columns: [stddev_samp(value_6)];
。 Spark 找不到名称为 value_6
.
当您调用 ....agg(stddev(columnName))
时,您会在输出 DataFrame
中得到一个名为 stddev(columnName)
的新列。您需要重命名聚合列:
def standardDeviationForColumn(df: DataFrame, columnName: String): DataFrame =
df.select(columnName).na.fill(0).agg(stddev(columnName) as columnName)