Spark数据框StructField中可空的意义
Significance of nullable in Spark dataframe StructField
nullable 有什么意义?
case class StructField(
name: String,
dataType: DataType,
nullable: Boolean = true,
metadata: Metadata = Metadata.empty) {
来自文档,
StructField(name, dataType, nullable): Represents a field in a
StructType. The name of a field is indicated by name. The data type of
a field is indicated by dataType. nullable is used to indicate if
values of this fields can have null values.
是否仅供参考?因为我看不到它正在强制执行非空值(或者我遗漏了什么?)
计划:
val cols = "firstName:String:false,middlename:String:true,lastName:String:false,zipCode:String:false,sex:String:false,salary:Int:true"
def inferType(field: String): StructField = {
val splits = field.split(":")
val colName = splits(0)
val nullable = splits(2).toBoolean
val dataType = splits(1).toUpperCase() match {
case "INT" => IntegerType
case "DOUBLE" => DoubleType
case "STRING" => StringType
case _ => StringType
}
StructField(colName, dataType, nullable)
}
val schema: StructType = StructType(cols
.split(",")
.map(col => inferType(col)))
val simpleData = Seq(
Row("Soumya","","Kole","36636","M",-1),
Row("Foo","Bar","","","",9000)
)
val rdd = spark.sparkContext.parallelize(simpleData)
val df = spark.createDataFrame(rdd, schema)
df.printSchema()
df.show()
输出:
root
|-- firstName: string (nullable = false)
|-- middlename: string (nullable = true)
|-- lastName: string (nullable = false)
|-- zipCode: string (nullable = false)
|-- sex: string (nullable = false)
|-- salary: integer (nullable = true)
+---------+----------+--------+-------+---+------+
|firstName|middlename|lastName|zipCode|sex|salary|
+---------+----------+--------+-------+---+------+
| Soumya| | Kole| 36636| M| -1|
| Foo| Bar| | | | 9000|
+---------+----------+--------+-------+---+------+
空格是empty strings
,不是NULLs
。它们是不同的。
nullable 有什么意义?
case class StructField(
name: String,
dataType: DataType,
nullable: Boolean = true,
metadata: Metadata = Metadata.empty) {
来自文档,
StructField(name, dataType, nullable): Represents a field in a StructType. The name of a field is indicated by name. The data type of a field is indicated by dataType. nullable is used to indicate if values of this fields can have null values.
是否仅供参考?因为我看不到它正在强制执行非空值(或者我遗漏了什么?)
计划:
val cols = "firstName:String:false,middlename:String:true,lastName:String:false,zipCode:String:false,sex:String:false,salary:Int:true"
def inferType(field: String): StructField = {
val splits = field.split(":")
val colName = splits(0)
val nullable = splits(2).toBoolean
val dataType = splits(1).toUpperCase() match {
case "INT" => IntegerType
case "DOUBLE" => DoubleType
case "STRING" => StringType
case _ => StringType
}
StructField(colName, dataType, nullable)
}
val schema: StructType = StructType(cols
.split(",")
.map(col => inferType(col)))
val simpleData = Seq(
Row("Soumya","","Kole","36636","M",-1),
Row("Foo","Bar","","","",9000)
)
val rdd = spark.sparkContext.parallelize(simpleData)
val df = spark.createDataFrame(rdd, schema)
df.printSchema()
df.show()
输出:
root
|-- firstName: string (nullable = false)
|-- middlename: string (nullable = true)
|-- lastName: string (nullable = false)
|-- zipCode: string (nullable = false)
|-- sex: string (nullable = false)
|-- salary: integer (nullable = true)
+---------+----------+--------+-------+---+------+
|firstName|middlename|lastName|zipCode|sex|salary|
+---------+----------+--------+-------+---+------+
| Soumya| | Kole| 36636| M| -1|
| Foo| Bar| | | | 9000|
+---------+----------+--------+-------+---+------+
空格是empty strings
,不是NULLs
。它们是不同的。