如何在 Spark 中分解数据子集？

Question

我一直在尝试重复 Zeppelin notebook Magellan 示例，但使用具有地理位置信息的资产（资产 DF）并尝试将它们映射到邮政编码（zipcode DF）。我从 USGS 获得了邮政编码形状文件并将其定位到 Spark 中。

这是资产 DF 的样子。它由资产 ID 和地图上的一个点组成。

+---------+--------------------+
|    asset|               point|
+---------+--------------------+
|       10|Point(-96.7595319...|
|       11|Point(4.7115951, ...|

邮政编码 DF 是根据美国邮政编码的 USGS shapefile 构建的。这是我使用 Magellan

val zipcodes = magellanContext.read.format("magellan").
load("magellan_us_states").
select($"polygon", $"metadata").
cache()

结果是邮政编码 DF 是

+--------------------+--------------------+
|             polygon|            metadata|
+--------------------+--------------------+
|Polygon(5, Wrappe...|Map(YEAR_ADM ->  ...|
|Polygon(5, Wrappe...|Map(YEAR_ADM ->  ...|

然后我把两个DF连在一起做查询

val joined = zipcodes.
join(assets).
where($"point" within $"polygon").
select($"asset", explode($"metadata").as(Seq("k", "v"))).
withColumnRenamed("v", "state").
drop("k").
cache()

结果如下：

+--------+--------------------+
|  asset#|               state|
+--------+--------------------+
|10      |Arizona             |
|10      |                  48|
|10      |                1903|
|10      |                  04|
|10      |              23.753|
|10      |  February          |
|10      |                1912|
|10      |              28.931|
|10      |                  14|
|11      |North Carolina      |
...

问题是当我分解元数据时我只想要状态。我如何分解该数据，以便我最终得到 table 看起来像这样 -

+--------+--------------------+
|  asset#|               state|
+--------+--------------------+
|10      |Arizona             |
|11      |North Carolina      |
|12      |Arizona             |
...

Answer 1

How do i explode that data so that i only end up with table that looks like this

干脆不要使用explode。相反，您只需 select 感兴趣的领域：

df.select($"asset",  $"metadata".getItem("state").alias("state"))

如何在 Spark 中分解数据子集？

How to explode a subset of data in Spark?

apache-spark

apache-spark-sql

apache-zeppelin