Spark DataFrame 使用 where 从数组中提取值

Question

我有一个具有以下架构的数据框：

root
 |-- id: long (nullable = true)
 |-- raw_data: struct (nullable = true)
 |    |-- address_components: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- long_name: string (nullable = true)
 |    |    |    |-- short_name: string (nullable = true)
 |    |    |    |-- types: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)

address_components的例子：

{
   "address_components":[
      {
         "long_name":"Portugal",
         "short_name":"PT",
         "types":[
            "country",
            "political"
         ]
      },
      {
         "long_name":"8200-591",
         "short_name":"8200-591",
         "types":[
            "postal_code"
         ]
      }
   ]
}

我想创建一个新的根级属性：Country: string 应该包含 PT。但是，选择应该基于array_contains(col("types"), "country")

我想出了一部分是这样的：

df = df.withColumn("country", expr("filter(raw_data.address_components, c -> array_contains(c.types, 'country'))"))
       .withColumn("country", col("country").getItem(0).getItem("long_name"))

有没有 smarter/shorter 方法来做到这一点？

Answer 1

我使用表达式结合 withColumn 修复了它：

df = df.withColumn("country", expr("filter(raw_data.address_components, c -> array_contains(c.types, 'country'))[0].short_name"))

Spark DataFrame 使用 where 从数组中提取值

Spark DataFrame extract value from array with where

scala

apache-spark

apache-spark-sql