Spark 的 Sort Merge Join 中的键什么时候不可排序?

When are Keys Not Sortable in Sort Merge Join in Spark?

当我阅读有关 Sort Merge Join 的文章时,它说这是广播连接后 spark 中最受欢迎的一种,但前提是连接键是可排序的。我的问题是连接键什么时候可以不可排序?可以对任何数据类型进行排序。你能帮我理解一个键可能无法排序的场景吗?

参见https://www.waitingforcode.com/apache-spark-sql/sort-merge-join-spark-sql/read。很棒的网站。

并非所有类型都可以排序。例如 CalendarIntervalType.

引用:

"for not sortable keys the sort merge join" should "not be used" in {
import sparkSession.implicits._
// Here we explicitly define the schema. Thanks to that we can show
// the case when sort-merge join won't be used, i.e. when the key is not sortable
// (there are other cases - when broadcast or shuffle joins can be chosen over sort-merge
//  but it's not shown here).
// Globally, a "sortable" data type is:
// - NullType, one of AtomicType
// - StructType having all fields sortable
// - ArrayType typed to sortable field
// - User Defined DataType backed by a sortable field
// The method checking sortability is   org.apache.spark.sql.catalyst.expressions.RowOrdering.isOrderable
// As  you see, CalendarIntervalType is not included in any of above points,
// so even if the data structure is the same (id + login for customers, id + customer id + amount for orders)
// with exactly the same number of rows, the sort-merge join won't be applied here.

这是一个旧的post,因为v3可以进行比较。 https://spark.apache.org/docs/3.0.0/api/scala/org/apache/spark/sql/types/CalendarIntervalType.html

但它证明了这一点。

此外,non-equi 加入怎么样?