火花交叉连接,两个相似的代码,一个有效,一个无效
spark cross join,two similar code,one works,one not
我有以下代码:
val ori0 = Seq(
(0l, "1")
).toDF("id", "col1")
val date0 = Seq(
(0l, "1")
).toDF("id", "date")
val joinExpression = $"col1" === $"date"
ori0.join(date0, joinExpression).show()
val ori = spark.range(1).withColumn("col1", lit("1"))
val date = spark.range(1).withColumn("date", lit("1"))
ori.join(date,joinExpression).show()
第一个连接有效,但第二个有错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Range (0, 1, step=1, splits=Some(4))
and
Project [_1#11L AS id#14L, _2#12 AS date#15]
+- Filter (isnotnull(_2#12) && (1 = _2#12))
+- LocalRelation [_1#11L, _2#12]
Join condition is missing or trivial.
看了很多遍,不知道为什么是cross join,它们有什么区别?
如果您展开第二个联接,您会发现它实际上等同于:
SELECT *
FROM ori JOIN date
WHERE 1 = 1
很明显WHERE 1 = 1
join condition trivial,这是Spark检测笛卡尔的条件之一。
在第一种情况下情况并非如此,因为优化器此时无法推断连接列仅包含单个值,并将尝试应用散列或排序合并连接。
我有以下代码:
val ori0 = Seq(
(0l, "1")
).toDF("id", "col1")
val date0 = Seq(
(0l, "1")
).toDF("id", "date")
val joinExpression = $"col1" === $"date"
ori0.join(date0, joinExpression).show()
val ori = spark.range(1).withColumn("col1", lit("1"))
val date = spark.range(1).withColumn("date", lit("1"))
ori.join(date,joinExpression).show()
第一个连接有效,但第二个有错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Range (0, 1, step=1, splits=Some(4))
and
Project [_1#11L AS id#14L, _2#12 AS date#15]
+- Filter (isnotnull(_2#12) && (1 = _2#12))
+- LocalRelation [_1#11L, _2#12]
Join condition is missing or trivial.
看了很多遍,不知道为什么是cross join,它们有什么区别?
如果您展开第二个联接,您会发现它实际上等同于:
SELECT *
FROM ori JOIN date
WHERE 1 = 1
很明显WHERE 1 = 1
join condition trivial,这是Spark检测笛卡尔的条件之一。
在第一种情况下情况并非如此,因为优化器此时无法推断连接列仅包含单个值,并将尝试应用散列或排序合并连接。