spark join 导致列 id 歧义错误
spark join causing column id ambiguity error
我有以下数据框:
accumulated_results_df
|-- company_id: string (nullable = true)
|-- max_dd: string (nullable = true)
|-- min_dd: string (nullable = true)
|-- count: string (nullable = true)
|-- mean: string (nullable = true)
computed_df
|-- company_id: string (nullable = true)
|-- min_dd: date (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
正在尝试使用 spark-sql 进行连接,如下所示
val resultDf = accumulated_results_df.as("a").join(computed_df.as("c"),
( $"a.company_id" === $"c.company_id" ) && ( $"c.min_dd" > $"a.max_dd" ), "left")
给出错误为:
org.apache.spark.sql.AnalysisException: Reference 'company_id' is ambiguous, could be: a.company_id, c.company_id.;
我在这里做错了什么以及如何解决这个问题?
我已将其修复如下。
val resultDf = accumulated_results_df.join(computed_df.withColumnRenamed("company_id", "right_company_id").as("c"),
( accumulated_results_df("company_id" ) === $"c.right_company_id" && ( $"c.min_dd" > accumulated_results_df("max_dd") ) )
, "left")
应该可以使用 col 函数正确引用别名数据帧和列
val resultDf = (accumulated_results_df.as("a")
.join(
computed_df.as("c"),
(col("a.company_id") === col("c.company_id")) && (col("c.min_dd") > col("a.max_dd")),
"left"
)
我有以下数据框:
accumulated_results_df
|-- company_id: string (nullable = true)
|-- max_dd: string (nullable = true)
|-- min_dd: string (nullable = true)
|-- count: string (nullable = true)
|-- mean: string (nullable = true)
computed_df
|-- company_id: string (nullable = true)
|-- min_dd: date (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
正在尝试使用 spark-sql 进行连接,如下所示
val resultDf = accumulated_results_df.as("a").join(computed_df.as("c"),
( $"a.company_id" === $"c.company_id" ) && ( $"c.min_dd" > $"a.max_dd" ), "left")
给出错误为:
org.apache.spark.sql.AnalysisException: Reference 'company_id' is ambiguous, could be: a.company_id, c.company_id.;
我在这里做错了什么以及如何解决这个问题?
我已将其修复如下。
val resultDf = accumulated_results_df.join(computed_df.withColumnRenamed("company_id", "right_company_id").as("c"),
( accumulated_results_df("company_id" ) === $"c.right_company_id" && ( $"c.min_dd" > accumulated_results_df("max_dd") ) )
, "left")
应该可以使用 col 函数正确引用别名数据帧和列
val resultDf = (accumulated_results_df.as("a")
.join(
computed_df.as("c"),
(col("a.company_id") === col("c.company_id")) && (col("c.min_dd") > col("a.max_dd")),
"left"
)