Scala Dataframe 列将 URL 参数拆分为新列

Scala Dataframe column split URL parameters to new columns

我的数据框中的一列中有 URL 数据,我需要从查询字符串中解析出参数并为其创建新列。

参数有时存在,有时不存在,而且它们没有特定的保证顺序,所以我需要能够通过名称找到它们。我正在用 Qcala 写这篇文章,但语法不正确,希望得到一些帮助。

我的代码:


val df = Seq(
  (1, "https://www.mywebsite.com/dummyurl/single?originlatitude=35.0133612060147&originlongitude=-116.156211232302&origincountrycode=us&originstateprovincecode=ca&origincity=boston&originradiusmiles=250&datestart=2021-12-23t00%3a00%3a00"),
  (2, "https://www.mywebsite.com/dummyurl/single?originlatitude=19.9141319141121&originlongitude=-56.1241881401291&origincountrycode=us&originstateprovincecode=pa&origincity=york&originradiusmiles=100&destinationlatitude=40.7811012268066&destinationlon")
).toDF("key", "URL")

val result = df

// .withColumn("param_name", $"URL")
.withColumn("parsed_url", explode(split(expr("parse_url(URL, 'QUERY')"), "&")))
.withColumn("parsed_url2", split($"parsed_url", "="))
// .withColumn("exampletest",$"URL".map(kv: String => (kv.split("=")(0), kv.split("=")(1))) )
.withColumn("Search_OriginLongitude", split($"URL","\?"))
.withColumn("Search_OriginLongitude2", split($"Search_OriginLongitude"(1),"&"))

//   .map(kv: Any => (kv.split("=")(0), kv.split("=")(1)))
//   .toMap
//   .get("originlongitude"))

display(result)

想要的结果:

+---+--------------------+--------------------+--------------------+
|KEY|                 URL|     originlatitude |    originlongitude |
+---+--------------------+--------------------+--------------------+
|  1|https://www.myweb...| 35.0133612060147   | -116.156211232302  |
|  2|https://www.myweb...| 19.9141319141121   | -56.1241881401291  |
+---+--------------------+--------------------+--------------------+

parse_url函数其实可以带第三个参数key作为你要提取的查询参数名,像这样:

val result = df
  .withColumn("Search_OriginLongitude", expr("parse_url(URL, 'QUERY', 'originlatitude')"))
  .withColumn("Search_OriginLongitude2", expr("parse_url(URL, 'QUERY', 'originlongitude')"))

result.show
//+---+--------------------+----------------------+-----------------------+
//|key|                 URL|Search_OriginLongitude|Search_OriginLongitude2|
//+---+--------------------+----------------------+-----------------------+
//|  1|https://www.myweb...|      35.0133612060147|      -116.156211232302|
//|  2|https://www.myweb...|      19.9141319141121|      -56.1241881401291|
//+---+--------------------+----------------------+-----------------------+

或者您可以使用 str_to_map 函数创建 parameter->value 的地图,如下所示:

val result = df
  .withColumn("URL", expr("str_to_map(split(URL,'[?]')[1],'&','=')"))
  .withColumn("Search_OriginLongitude", col("URL").getItem("originlatitude"))
  .withColumn("Search_OriginLongitude2", col("URL").getItem("originlongitude"))