Scala Dataframe 列将 URL 参数拆分为新列
Scala Dataframe column split URL parameters to new columns
我的数据框中的一列中有 URL
数据,我需要从查询字符串中解析出参数并为其创建新列。
参数有时存在,有时不存在,而且它们没有特定的保证顺序,所以我需要能够通过名称找到它们。我正在用 Qcala 写这篇文章,但语法不正确,希望得到一些帮助。
我的代码:
val df = Seq(
(1, "https://www.mywebsite.com/dummyurl/single?originlatitude=35.0133612060147&originlongitude=-116.156211232302&origincountrycode=us&originstateprovincecode=ca&origincity=boston&originradiusmiles=250&datestart=2021-12-23t00%3a00%3a00"),
(2, "https://www.mywebsite.com/dummyurl/single?originlatitude=19.9141319141121&originlongitude=-56.1241881401291&origincountrycode=us&originstateprovincecode=pa&origincity=york&originradiusmiles=100&destinationlatitude=40.7811012268066&destinationlon")
).toDF("key", "URL")
val result = df
// .withColumn("param_name", $"URL")
.withColumn("parsed_url", explode(split(expr("parse_url(URL, 'QUERY')"), "&")))
.withColumn("parsed_url2", split($"parsed_url", "="))
// .withColumn("exampletest",$"URL".map(kv: String => (kv.split("=")(0), kv.split("=")(1))) )
.withColumn("Search_OriginLongitude", split($"URL","\?"))
.withColumn("Search_OriginLongitude2", split($"Search_OriginLongitude"(1),"&"))
// .map(kv: Any => (kv.split("=")(0), kv.split("=")(1)))
// .toMap
// .get("originlongitude"))
display(result)
想要的结果:
+---+--------------------+--------------------+--------------------+
|KEY| URL| originlatitude | originlongitude |
+---+--------------------+--------------------+--------------------+
| 1|https://www.myweb...| 35.0133612060147 | -116.156211232302 |
| 2|https://www.myweb...| 19.9141319141121 | -56.1241881401291 |
+---+--------------------+--------------------+--------------------+
parse_url
函数其实可以带第三个参数key
作为你要提取的查询参数名,像这样:
val result = df
.withColumn("Search_OriginLongitude", expr("parse_url(URL, 'QUERY', 'originlatitude')"))
.withColumn("Search_OriginLongitude2", expr("parse_url(URL, 'QUERY', 'originlongitude')"))
result.show
//+---+--------------------+----------------------+-----------------------+
//|key| URL|Search_OriginLongitude|Search_OriginLongitude2|
//+---+--------------------+----------------------+-----------------------+
//| 1|https://www.myweb...| 35.0133612060147| -116.156211232302|
//| 2|https://www.myweb...| 19.9141319141121| -56.1241881401291|
//+---+--------------------+----------------------+-----------------------+
或者您可以使用 str_to_map
函数创建 parameter->value
的地图,如下所示:
val result = df
.withColumn("URL", expr("str_to_map(split(URL,'[?]')[1],'&','=')"))
.withColumn("Search_OriginLongitude", col("URL").getItem("originlatitude"))
.withColumn("Search_OriginLongitude2", col("URL").getItem("originlongitude"))
我的数据框中的一列中有 URL
数据,我需要从查询字符串中解析出参数并为其创建新列。
参数有时存在,有时不存在,而且它们没有特定的保证顺序,所以我需要能够通过名称找到它们。我正在用 Qcala 写这篇文章,但语法不正确,希望得到一些帮助。
我的代码:
val df = Seq(
(1, "https://www.mywebsite.com/dummyurl/single?originlatitude=35.0133612060147&originlongitude=-116.156211232302&origincountrycode=us&originstateprovincecode=ca&origincity=boston&originradiusmiles=250&datestart=2021-12-23t00%3a00%3a00"),
(2, "https://www.mywebsite.com/dummyurl/single?originlatitude=19.9141319141121&originlongitude=-56.1241881401291&origincountrycode=us&originstateprovincecode=pa&origincity=york&originradiusmiles=100&destinationlatitude=40.7811012268066&destinationlon")
).toDF("key", "URL")
val result = df
// .withColumn("param_name", $"URL")
.withColumn("parsed_url", explode(split(expr("parse_url(URL, 'QUERY')"), "&")))
.withColumn("parsed_url2", split($"parsed_url", "="))
// .withColumn("exampletest",$"URL".map(kv: String => (kv.split("=")(0), kv.split("=")(1))) )
.withColumn("Search_OriginLongitude", split($"URL","\?"))
.withColumn("Search_OriginLongitude2", split($"Search_OriginLongitude"(1),"&"))
// .map(kv: Any => (kv.split("=")(0), kv.split("=")(1)))
// .toMap
// .get("originlongitude"))
display(result)
想要的结果:
+---+--------------------+--------------------+--------------------+
|KEY| URL| originlatitude | originlongitude |
+---+--------------------+--------------------+--------------------+
| 1|https://www.myweb...| 35.0133612060147 | -116.156211232302 |
| 2|https://www.myweb...| 19.9141319141121 | -56.1241881401291 |
+---+--------------------+--------------------+--------------------+
parse_url
函数其实可以带第三个参数key
作为你要提取的查询参数名,像这样:
val result = df
.withColumn("Search_OriginLongitude", expr("parse_url(URL, 'QUERY', 'originlatitude')"))
.withColumn("Search_OriginLongitude2", expr("parse_url(URL, 'QUERY', 'originlongitude')"))
result.show
//+---+--------------------+----------------------+-----------------------+
//|key| URL|Search_OriginLongitude|Search_OriginLongitude2|
//+---+--------------------+----------------------+-----------------------+
//| 1|https://www.myweb...| 35.0133612060147| -116.156211232302|
//| 2|https://www.myweb...| 19.9141319141121| -56.1241881401291|
//+---+--------------------+----------------------+-----------------------+
或者您可以使用 str_to_map
函数创建 parameter->value
的地图,如下所示:
val result = df
.withColumn("URL", expr("str_to_map(split(URL,'[?]')[1],'&','=')"))
.withColumn("Search_OriginLongitude", col("URL").getItem("originlatitude"))
.withColumn("Search_OriginLongitude2", col("URL").getItem("originlongitude"))