从 PySpark 的 Web 请求日志中删除 url 参数的最佳做法是什么？

Question

我是 PySpark 的新手，我想 exclude/remove 来自 spark 数据帧中存在的原始网络日志的 URL 参数。数据的性质如下：

+----------------------------------------------------------------------------------------+
|weblog                                                                                  |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000]                                                            |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q==               |     
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g==              |
|"HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms                         |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"                                    |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                        |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000]                                                            |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp                   | 
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?                 |
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03-                   |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |                                                                               
+----------------------------------------------------------------------------------------+

所以我想删除 ? 之后的所有内容，如下所示：

+----------------------------------------------------------------------------------------+
|this part should be removed from weblog                                                 |                      
+----------------------------------------------------------------------------------------+
|xxxx-clientt=005                                                                        |
+----------------------------------------------------------------------------------------+
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03-                   |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |                                       
+----------------------------------------------------------------------------------------+

我的预期输出是这样的：

+----------------------------------------------------------------------------------------+
|weblog                                                                                  |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000]                                                            |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q==               |     
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g==              |
|"HEAD /xxxx/pub/ping? HTTP/1.1" 200 "-b" 53b 7ms                                        |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"                                    |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                        |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000]                                                            |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp                   | 
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?                 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |                                                                               
+----------------------------------------------------------------------------------------+

所以我试图找到一种受此启发的快速安全的方法但如果你在我尝试时看到这个问题末尾的 colab 笔记本，我将无法适应：

from urllib.parse import urlsplit, urlunsplit

def remove_query_params_and_fragment(url):
    return urlunsplit(urlsplit(url)._replace(query=""))

我尝试了以下方法但没有成功，遗憾的是无法从其余部分中排除所需的部分并清理它：

from pyspark.sql.functions import udf
from urllib.parse import urlsplit

schema2 = StructType(
    [
        StructField("path", StringType(), False),
        StructField("query", ArrayType(StringType(), False), True),
        StructField("fragment", StringType(), True),
    ]
)


def _parse_url(s):
    data = urlsplit(s)
    if data[3]:
        query_params = list()
        query_params.append(data[3])
    else:
        query_params = None
    return {
        "path": "{}://{}/{}".format(data[0], data[1].rstrip("/"), data[2]),
        "query": query_params,
        "fragment": data[4],
    }


url_parse_udf = f.udf(_parse_url, schema2)

parsed = sdf.select("*", url_parse_udf(sdf["weblog"]).alias("data"))

#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|col                                                                                                                                                                                                                                                 #|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    #|
#|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我尝试的问题是 returns 原始博客中 ? 之后的所有内容。我提供了 colab notebook 用于快速调试。我也在想是否有一种机制可以解析博客并提取 URL 参数，然后减去两列，例如：

sdf1 = sdf.withColumn('Result', ( sdf['weblog'] - sdf['url_parameters'] ))

weblog	url_parameters	Results (weblog - url_parameters)
03/Oct/2021:09:26:37 +0000...xxxx-clientt=005...	xxxx-clientt=005	...
03/Oct/2021:00:19:24 +0000...$format=json&$...	$format=json&$...	...

Answer 1

尽可能避免使用 UDF。 UDF 就像 pyspark 的黑匣子，因此 spark 无法有效地对其应用优化。详情请阅读.

除了使用 Udfs，您还可以直接使用 pyspark 的 sql 函数。


from pyspark.sql.functions import split
# from urllib.parse import urlsplit
split_with_question_mark = split(sdf.weblog, '\?')
param_separated_df = sdf.withColumn("before_param", split_with_question_mark[0]).withColumn("after_param", split_with_question_mark[1])
param_separated_df.show(truncate=False)

结果：


+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog                                                                                                                                                                                                                                                                                                                                                                                                                        |before_param                                                                                                                                                                                          |after_param                                                                                                                                                                                                                                         |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                       |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection                             |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

一旦你分离了前查询Url，你可以通过 Http 方法类型拆分后查询部分，即 HTTP/1.1 以获得查询参数。

import pyspark.sql.functions as func

separated_by_comma = param_separated_df.withColumn("query_param", func.split(param_separated_df["after_param"], 'HTTP/1.1')[0]);
separated_by_comma.show(truncate=False)

结果：

+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog                                                                                                                                                                                                                                                                                                                                                                                                                        |before_param                                                                                                                                                                                          |after_param                                                                                                                                                                                                                                         |query_param                                                                                                                                                  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                       |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - -                                                                                                    |xxxx-client=005                                                                                                                                              |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection                             |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

以上所有更改均在您分享的 collab 中进行。

从 PySpark 的 Web 请求日志中删除 url 参数的最佳做法是什么？

What is the best practice to remove url parameters from web request log in PySpark?

url-parsing

url-parameters

pyspark