Spark DataFrame 使用 where 从数组中提取值
Spark DataFrame extract value from array with where
我有一个具有以下架构的数据框:
root
|-- id: long (nullable = true)
|-- raw_data: struct (nullable = true)
| |-- address_components: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- long_name: string (nullable = true)
| | | |-- short_name: string (nullable = true)
| | | |-- types: array (nullable = true)
| | | | |-- element: string (containsNull = true)
address_components
的例子:
{
"address_components":[
{
"long_name":"Portugal",
"short_name":"PT",
"types":[
"country",
"political"
]
},
{
"long_name":"8200-591",
"short_name":"8200-591",
"types":[
"postal_code"
]
}
]
}
我想创建一个新的根级属性:Country: string
应该包含 PT
。
但是,选择应该基于array_contains(col("types"), "country")
我想出了一部分是这样的:
df = df.withColumn("country", expr("filter(raw_data.address_components, c -> array_contains(c.types, 'country'))"))
.withColumn("country", col("country").getItem(0).getItem("long_name"))
有没有 smarter/shorter 方法来做到这一点?
我使用表达式结合 withColumn 修复了它:
df = df.withColumn("country", expr("filter(raw_data.address_components, c -> array_contains(c.types, 'country'))[0].short_name"))
我有一个具有以下架构的数据框:
root
|-- id: long (nullable = true)
|-- raw_data: struct (nullable = true)
| |-- address_components: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- long_name: string (nullable = true)
| | | |-- short_name: string (nullable = true)
| | | |-- types: array (nullable = true)
| | | | |-- element: string (containsNull = true)
address_components
的例子:
{
"address_components":[
{
"long_name":"Portugal",
"short_name":"PT",
"types":[
"country",
"political"
]
},
{
"long_name":"8200-591",
"short_name":"8200-591",
"types":[
"postal_code"
]
}
]
}
我想创建一个新的根级属性:Country: string
应该包含 PT
。
但是,选择应该基于array_contains(col("types"), "country")
我想出了一部分是这样的:
df = df.withColumn("country", expr("filter(raw_data.address_components, c -> array_contains(c.types, 'country'))"))
.withColumn("country", col("country").getItem(0).getItem("long_name"))
有没有 smarter/shorter 方法来做到这一点?
我使用表达式结合 withColumn 修复了它:
df = df.withColumn("country", expr("filter(raw_data.address_components, c -> array_contains(c.types, 'country'))[0].short_name"))