Scala - Spark sql 结构上的行模式匹配
Scala - Spark sql row pattern matching on struct
我正在尝试在 Dataframe 映射函数内进行模式匹配 - 将行与具有嵌套大小写的行模式匹配 Class。此数据框是连接的结果,具有如下所示的架构。它有一些原始类型的列和 2 个复合列:
case class MyList(values: Seq[Integer])
case class MyItem(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer)
val myLine1 = new MyItem ("MyKey01", "MyKey02", 1, new MyList(Seq(1)), new MyList(Seq(2)), 2)
val myLine2 = new MyItem ("YourKey01", "YourKey02", 2, new MyList(Seq(2,3)), new MyList(Seq(4,5)), 20)
val dfRaw = Seq(myLine1, myLine2).toDF
dfRaw.printSchema
dfRaw.show
val df2 = dfRaw.map(r => r match {
case Row(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer) => "Matched"
case _ => "Un matched"
})
df2.show
我的问题是,在那个地图函数之后,我得到的只是 "Un matched":
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- field1: integer (nullable = true)
|-- group1: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- group2: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- field2: integer (nullable = true)
+---------+---------+------+--------------------+--------------------+------+
| key1| key2|field1| group1| group2|field2|
+---------+---------+------+--------------------+--------------------+------+
| MyKey01| MyKey02| 1| [WrappedArray(1)]| [WrappedArray(2)]| 2|
|YourKey01|YourKey02| 2|[WrappedArray(2, 3)]|[WrappedArray(4, 5)]| 20|
+---------+---------+------+--------------------+--------------------+------+
df2: org.apache.spark.sql.Dataset[String] = [value: string]
+----------+
| value|
+----------+
|Un matched|
|Un matched|
+----------+
如果我忽略 case 分支中的那两个结构列(将 group1: MyList, group2: MyList 替换为 _, _, 就可以了
case Row(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer) => "Matched"
能否请您帮忙解决一下如何对该案例进行模式匹配class?
谢谢!
struct
列在 spark
中被视为 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
因此您必须将匹配大小写定义为
import org.apache.spark.sql.catalyst.expressions._
val df2 = dfRaw.map(r => r match {
case Row(key1: String, key2: String, field1: Integer, group1: GenericRowWithSchema, group2: GenericRowWithSchema, field2: Integer) => "Matched"
case _ => "Un matched"
})
并且使用通配符 (_) 定义匹配案例 因为 Scala 编译器将 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
隐式计算为数据类型 。
如下定义 case 也应该有效 由于隐式评估的通配符
case Row(key1: String, key2: String, field1: Integer, group1, group2, field2: Integer) => "Matched"
我正在尝试在 Dataframe 映射函数内进行模式匹配 - 将行与具有嵌套大小写的行模式匹配 Class。此数据框是连接的结果,具有如下所示的架构。它有一些原始类型的列和 2 个复合列:
case class MyList(values: Seq[Integer])
case class MyItem(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer)
val myLine1 = new MyItem ("MyKey01", "MyKey02", 1, new MyList(Seq(1)), new MyList(Seq(2)), 2)
val myLine2 = new MyItem ("YourKey01", "YourKey02", 2, new MyList(Seq(2,3)), new MyList(Seq(4,5)), 20)
val dfRaw = Seq(myLine1, myLine2).toDF
dfRaw.printSchema
dfRaw.show
val df2 = dfRaw.map(r => r match {
case Row(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer) => "Matched"
case _ => "Un matched"
})
df2.show
我的问题是,在那个地图函数之后,我得到的只是 "Un matched":
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- field1: integer (nullable = true)
|-- group1: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- group2: struct (nullable = true)
| |-- values: array (nullable = true)
| | |-- element: integer (containsNull = true)
|-- field2: integer (nullable = true)
+---------+---------+------+--------------------+--------------------+------+
| key1| key2|field1| group1| group2|field2|
+---------+---------+------+--------------------+--------------------+------+
| MyKey01| MyKey02| 1| [WrappedArray(1)]| [WrappedArray(2)]| 2|
|YourKey01|YourKey02| 2|[WrappedArray(2, 3)]|[WrappedArray(4, 5)]| 20|
+---------+---------+------+--------------------+--------------------+------+
df2: org.apache.spark.sql.Dataset[String] = [value: string]
+----------+
| value|
+----------+
|Un matched|
|Un matched|
+----------+
如果我忽略 case 分支中的那两个结构列(将 group1: MyList, group2: MyList 替换为 _, _, 就可以了
case Row(key1: String, key2: String, field1: Integer, group1: MyList, group2: MyList, field2: Integer) => "Matched"
能否请您帮忙解决一下如何对该案例进行模式匹配class? 谢谢!
struct
列在 spark
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
因此您必须将匹配大小写定义为
import org.apache.spark.sql.catalyst.expressions._
val df2 = dfRaw.map(r => r match {
case Row(key1: String, key2: String, field1: Integer, group1: GenericRowWithSchema, group2: GenericRowWithSchema, field2: Integer) => "Matched"
case _ => "Un matched"
})
并且使用通配符 (_) 定义匹配案例 因为 Scala 编译器将 org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
隐式计算为数据类型 。
如下定义 case 也应该有效 由于隐式评估的通配符
case Row(key1: String, key2: String, field1: Integer, group1, group2, field2: Integer) => "Matched"