使用 spark.sql parse_url() 从包含大括号或管道的 URL 中提取主机
Extract HOST from URL containing braces or pipes using spark.sql parse_url()
我需要从数百万 URL 中提取主机。一些 URL 格式不正确并且 return NULL。在许多情况下,我看到大括号 ({}
) 或管道 (|
) 导致问题,其他时候我看到多个散列 (#
) 字符导致问题。
这是我的代码,其中包含我需要解析的 URL:
val b = Seq(
("https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}"),
("https://example.com/test.aspx?names=John|Peter"),
("https://example.com/#/test.aspx?help=John#top"),
("https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12"),
).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT parse_url(`url_col`, 'HOST') as HOST, url_col from temp").show(false)
预期输出:
+-----------+------------------------------------------------------------------------+
|HOST |url_col |
+-----------+------------------------------------------------------------------------+
|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|example.com|https://example.com/test.aspx?names=John|Peter |
|example.com|https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
+-----------+------------------------------------------------------------------------+
当前输出:
+-----------+------------------------------------------------------------------------+
|HOST |url_col |
+-----------+------------------------------------------------------------------------+
|null |https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|null |https://example.com/test.aspx?names=John|Peter |
|null |https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
+-----------+------------------------------------------------------------------------+
当 URL 包含无效字符或格式错误时,是否有办法强制 parse_url 到 return 主机?或者有更好的方法吗?
您可以使用 regexp_extract
函数提取域(regex 的示例):
spark.sql("""
SELECT regexp_extract(url_col, "^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1) as HOST,
url_col
FROM temp
""").show(false)
//+-----------+------------------------------------------------------------------------+
//|HOST |url_col |
//+-----------+------------------------------------------------------------------------+
//|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
//|example.com|https://example.com/test.aspx?names=John|Peter |
//|example.com|https://example.com/#/test.aspx?help=John#top |
//|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
//+-----------+------------------------------------------------------------------------+
我需要从数百万 URL 中提取主机。一些 URL 格式不正确并且 return NULL。在许多情况下,我看到大括号 ({}
) 或管道 (|
) 导致问题,其他时候我看到多个散列 (#
) 字符导致问题。
这是我的代码,其中包含我需要解析的 URL:
val b = Seq(
("https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}"),
("https://example.com/test.aspx?names=John|Peter"),
("https://example.com/#/test.aspx?help=John#top"),
("https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12"),
).toDF("url_col")
b.createOrReplaceTempView("temp")
spark.sql("SELECT parse_url(`url_col`, 'HOST') as HOST, url_col from temp").show(false)
预期输出:
+-----------+------------------------------------------------------------------------+
|HOST |url_col |
+-----------+------------------------------------------------------------------------+
|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|example.com|https://example.com/test.aspx?names=John|Peter |
|example.com|https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
+-----------+------------------------------------------------------------------------+
当前输出:
+-----------+------------------------------------------------------------------------+
|HOST |url_col |
+-----------+------------------------------------------------------------------------+
|null |https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
|null |https://example.com/test.aspx?names=John|Peter |
|null |https://example.com/#/test.aspx?help=John#top |
|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
+-----------+------------------------------------------------------------------------+
当 URL 包含无效字符或格式错误时,是否有办法强制 parse_url 到 return 主机?或者有更好的方法吗?
您可以使用 regexp_extract
函数提取域(regex 的示例):
spark.sql("""
SELECT regexp_extract(url_col, "^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www.)?([^:\/\n?]+)", 1) as HOST,
url_col
FROM temp
""").show(false)
//+-----------+------------------------------------------------------------------------+
//|HOST |url_col |
//+-----------+------------------------------------------------------------------------+
//|example.com|https://example.com/test.aspx?doc={1A23B4C5-67D8-9012-E3F4-A5B67890CD12}|
//|example.com|https://example.com/test.aspx?names=John|Peter |
//|example.com|https://example.com/#/test.aspx?help=John#top |
//|example.com|https://example.com/test.aspx?doc=1A23B4C5-67D8-9012-E3F4-A5B67890CD12 |
//+-----------+------------------------------------------------------------------------+