从 PySpark 的 Web 请求日志中删除 url 参数的最佳做法是什么?
What is the best practice to remove url parameters from web request log in PySpark?
我是 PySpark 的新手,我想 exclude/remove 来自 spark 数据帧中存在的原始网络日志的 URL 参数。数据的性质如下:
+----------------------------------------------------------------------------------------+
|weblog |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== |
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== |
|"HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000] |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp |
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection? |
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03- |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |
+----------------------------------------------------------------------------------------+
所以我想删除 ?
之后的所有内容,如下所示:
+----------------------------------------------------------------------------------------+
|this part should be removed from weblog |
+----------------------------------------------------------------------------------------+
|xxxx-clientt=005 |
+----------------------------------------------------------------------------------------+
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03- |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+----------------------------------------------------------------------------------------+
我的预期输出是这样的:
+----------------------------------------------------------------------------------------+
|weblog |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== |
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== |
|"HEAD /xxxx/pub/ping? HTTP/1.1" 200 "-b" 53b 7ms |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000] |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp |
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection? |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |
+----------------------------------------------------------------------------------------+
所以我试图找到一种受此启发的快速安全的方法 但如果你在我尝试时看到这个问题末尾的 colab 笔记本,我将无法适应:
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query=""))
我尝试了以下方法但没有成功,遗憾的是无法从其余部分中排除所需的部分并清理它:
from pyspark.sql.functions import udf
from urllib.parse import urlsplit
schema2 = StructType(
[
StructField("path", StringType(), False),
StructField("query", ArrayType(StringType(), False), True),
StructField("fragment", StringType(), True),
]
)
def _parse_url(s):
data = urlsplit(s)
if data[3]:
query_params = list()
query_params.append(data[3])
else:
query_params = None
return {
"path": "{}://{}/{}".format(data[0], data[1].rstrip("/"), data[2]),
"query": query_params,
"fragment": data[4],
}
url_parse_udf = f.udf(_parse_url, schema2)
parsed = sdf.select("*", url_parse_udf(sdf["weblog"]).alias("data"))
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|col #|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - #|
#|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
我尝试的问题是 returns 原始博客中 ?
之后的所有内容。我提供了 colab notebook 用于快速调试。我也在想是否有一种机制可以解析博客并提取 URL 参数,然后 减去 两列,例如:
sdf1 = sdf.withColumn('Result', ( sdf['weblog'] - sdf['url_parameters'] ))
weblog
url_parameters
Results (weblog - url_parameters)
03/Oct/2021:09:26:37 +0000...xxxx-clientt=005...
xxxx-clientt=005
...
03/Oct/2021:00:19:24 +0000...$format=json&$...
$format=json&$...
...
尽可能避免使用 UDF。 UDF 就像 pyspark 的黑匣子,因此 spark 无法有效地对其应用优化。详情请阅读.
除了使用 Udfs,您还可以直接使用 pyspark 的 sql 函数。
from pyspark.sql.functions import split
# from urllib.parse import urlsplit
split_with_question_mark = split(sdf.weblog, '\?')
param_separated_df = sdf.withColumn("before_param", split_with_question_mark[0]).withColumn("after_param", split_with_question_mark[1])
param_separated_df.show(truncate=False)
结果:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog |before_param |after_param |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
一旦你分离了前查询Url,你可以通过 Http 方法类型拆分后查询部分,即 HTTP/1.1
以获得查询参数。
import pyspark.sql.functions as func
separated_by_comma = param_separated_df.withColumn("query_param", func.split(param_separated_df["after_param"], 'HTTP/1.1')[0]);
separated_by_comma.show(truncate=False)
结果:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog |before_param |after_param |query_param |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |xxxx-client=005 |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
以上所有更改均在您分享的 collab 中进行。
我是 PySpark 的新手,我想 exclude/remove 来自 spark 数据帧中存在的原始网络日志的 URL 参数。数据的性质如下:
+----------------------------------------------------------------------------------------+
|weblog |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== |
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== |
|"HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000] |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp |
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection? |
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03- |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |
+----------------------------------------------------------------------------------------+
所以我想删除 ?
之后的所有内容,如下所示:
+----------------------------------------------------------------------------------------+
|this part should be removed from weblog |
+----------------------------------------------------------------------------------------+
|xxxx-clientt=005 |
+----------------------------------------------------------------------------------------+
|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime'2021-03- |
|24T00:15:05'%20and%20substringof('dude',SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+----------------------------------------------------------------------------------------+
我的预期输出是这样的:
+----------------------------------------------------------------------------------------+
|weblog |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] |
|SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== |
|user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== |
|"HEAD /xxxx/pub/ping? HTTP/1.1" 200 "-b" 53b 7ms |
|"Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" |
|WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
+----------------------------------------------------------------------------------------+
|[03/Oct/2021:00:19:24 +0000] |
|W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp |
|"GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection? |
|HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - - |
+----------------------------------------------------------------------------------------+
所以我试图找到一种受此启发的快速安全的方法
from urllib.parse import urlsplit, urlunsplit
def remove_query_params_and_fragment(url):
return urlunsplit(urlsplit(url)._replace(query=""))
我尝试了以下方法但没有成功,遗憾的是无法从其余部分中排除所需的部分并清理它:
from pyspark.sql.functions import udf
from urllib.parse import urlsplit
schema2 = StructType(
[
StructField("path", StringType(), False),
StructField("query", ArrayType(StringType(), False), True),
StructField("fragment", StringType(), True),
]
)
def _parse_url(s):
data = urlsplit(s)
if data[3]:
query_params = list()
query_params.append(data[3])
else:
query_params = None
return {
"path": "{}://{}/{}".format(data[0], data[1].rstrip("/"), data[2]),
"query": query_params,
"fragment": data[4],
}
url_parse_udf = f.udf(_parse_url, schema2)
parsed = sdf.select("*", url_parse_udf(sdf["weblog"]).alias("data"))
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|col #|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
#|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - #|
#|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
#+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
我尝试的问题是 returns 原始博客中 ?
之后的所有内容。我提供了 colab notebook 用于快速调试。我也在想是否有一种机制可以解析博客并提取 URL 参数,然后 减去 两列,例如:
sdf1 = sdf.withColumn('Result', ( sdf['weblog'] - sdf['url_parameters'] ))
weblog | url_parameters | Results (weblog - url_parameters) |
---|---|---|
03/Oct/2021:09:26:37 +0000...xxxx-clientt=005... | xxxx-clientt=005 | ... |
03/Oct/2021:00:19:24 +0000...$format=json&$... | $format=json&$... | ... |
尽可能避免使用 UDF。 UDF 就像 pyspark 的黑匣子,因此 spark 无法有效地对其应用优化。详情请阅读
除了使用 Udfs,您还可以直接使用 pyspark 的 sql 函数。
from pyspark.sql.functions import split
# from urllib.parse import urlsplit
split_with_question_mark = split(sdf.weblog, '\?')
param_separated_df = sdf.withColumn("before_param", split_with_question_mark[0]).withColumn("after_param", split_with_question_mark[1])
param_separated_df.show(truncate=False)
结果:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog |before_param |after_param |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
一旦你分离了前查询Url,你可以通过 Http 方法类型拆分后查询部分,即 HTTP/1.1
以获得查询参数。
import pyspark.sql.functions as func
separated_by_comma = param_separated_df.withColumn("query_param", func.split(param_separated_df["after_param"], 'HTTP/1.1')[0]);
separated_by_comma.show(truncate=False)
结果:
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|weblog |before_param |after_param |query_param |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping?xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |[03/Oct/2021:09:26:37 +0000] SsAzIiWuV1Bw9CtthtxTtav8VdmP3N2jkJ/ZTsx6u8ATOC8HFwxKYmWwMrwl6t7heGKU7+Q== user_ZwfikI/2BdNcrhkwWai/bh+zX66co70YwGKAigzuLTW4khCvc1LLmFN1aBH7K0Loq8g== "HEAD /xxxx/pub/ping|xxxx-client=005 HTTP/1.1" 200 "-b" 53b 7ms "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" WepX20WkyvTydOpOuk/IDIVsxN+4zOZbRzng== 50000 - - |xxxx-client=005 |
|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection?$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|[03/Oct/2021:00:19:24 +0000] W+APDZiRZIOjc/gmklDpL95WFxwkMRGthMXLnLDxbNZ6qZA== xxxxx.xxx.xxxx.corp "GET /xxxx/d5d/data/v10/notification_events/NotifcationEventCollection |$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 HTTP/1.1" 200 "-b" 7273b 391ms "python-requests/2.25.1" soso80-emea.xxxx.corp 50001 - -|$format=json&$filter=%20%20%2%20%20StartDate%20eq%20datetime"2021-03-24T00:15:05"%20and%20substringof("dude",SystemRoles)&$expand=MailLog&$skiptoken=3701%20 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
以上所有更改均在您分享的 collab 中进行。