如何从 Python 中的 URL 中删除 .com 和 "https://" 之后的字符串
How to remove string after .com and "https://" from an URL in Python
我需要使用 url 作为主键来合并两个数据框。但是,在 url 中有一些额外的字符串,就像在 df1 中一样,我有 https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com
我需要从 url 中删除 .com 和 https:// 之后的 /us/en-us.html,以便我可以使用 [=23= 执行合并] 在 2 个 df 之间。下面是一个简化的例子。解决方案是什么?
df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your-
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}
df1['url']==df2['url']
Out[7]: False
谢谢。
使用urlparse
并隔离主机名:
from urllib.parse import urlparse
urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'
URL 解析起来并不简单。看看标准库中的urllib module。
以下是删除域后路径的方法:
import urllib.parse
def remove_path(url):
parsed = urllib.parse.urlparse(url)
parsed = parsed._replace(path='')
return urllib.parse.urlunparse(parsed)
df1['url'] = df1['url'].apply(remove_path)
您可以使用 urlparse
as suggested by others, or you could also use urlsplit
。但是,两者都不会处理 www.cemexusa.com
。所以如果你不需要密钥中的方案,你可以使用这样的东西:
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
这是一个完整的工作示例:
import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
print(joined)
的输出将是:
Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
本回答可能还有其他特殊情况没有处理。根据您的数据,您可能还需要处理省略的 www
:
urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
urlparse
和urlsplit
有什么区别?
这取决于您的用例以及您想要提取的信息。由于您不需要 URL 的 params
,我建议使用 urlsplit
.
[urlsplit()
] is similar to urlparse()
, but does not split the params
from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit
我需要使用 url 作为主键来合并两个数据框。但是,在 url 中有一些额外的字符串,就像在 df1 中一样,我有 https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com
我需要从 url 中删除 .com 和 https:// 之后的 /us/en-us.html,以便我可以使用 [=23= 执行合并] 在 2 个 df 之间。下面是一个简化的例子。解决方案是什么?
df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your-
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}
df1['url']==df2['url']
Out[7]: False
谢谢。
使用urlparse
并隔离主机名:
from urllib.parse import urlparse
urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'
URL 解析起来并不简单。看看标准库中的urllib module。
以下是删除域后路径的方法:
import urllib.parse
def remove_path(url):
parsed = urllib.parse.urlparse(url)
parsed = parsed._replace(path='')
return urllib.parse.urlunparse(parsed)
df1['url'] = df1['url'].apply(remove_path)
您可以使用 urlparse
as suggested by others, or you could also use urlsplit
。但是,两者都不会处理 www.cemexusa.com
。所以如果你不需要密钥中的方案,你可以使用这样的东西:
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
这是一个完整的工作示例:
import pandas as pd
import io
from urllib.parse import urlsplit
df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")
df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")
df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)
def to_key(url):
if "://" not in url: # or: not re.match("(?:http|ftp|https)://"", url)
url = f"https://{url}"
return urlsplit(url).hostname
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)
joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))
# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)
print(joined)
的输出将是:
Description Key Last Update
0 Junk Food www.mcdonalds.com 2021
1 Cemex www.cemexusa.com 2020
本回答可能还有其他特殊情况没有处理。根据您的数据,您可能还需要处理省略的 www
:
urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com
urlsplit("https://www.realpython.com").hostname # also a valid URL
# www.realpython.com
urlparse
和urlsplit
有什么区别?
这取决于您的用例以及您想要提取的信息。由于您不需要 URL 的 params
,我建议使用 urlsplit
.
[
urlsplit()
] is similar tourlparse()
, but does not split theparams
from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit