如何从 Python 中的 URL 中删除 .com 和 "https://" 之后的字符串

Question

我需要使用 url 作为主键来合并两个数据框。但是，在 url 中有一些额外的字符串，就像在 df1 中一样，我有 https://www.mcdonalds.com/us/en-us.html, where in df2, I have https://www.mcdonalds.com

我需要从 url 中删除 .com 和 https:// 之后的 /us/en-us.html，以便我可以使用 [=23= 执行合并] 在 2 个 df 之间。下面是一个简化的例子。解决方案是什么？

df1={'url': ['https://www.mcdonalds.com/us/en-us.html','https://www.cemexusa.com/find-your- 
location']}
df2={'url':['https://www.mcdonalds.com','www.cemexusa.com']}

df1['url']==df2['url']
Out[7]: False

谢谢。

Answer 1

使用urlparse并隔离主机名：

from urllib.parse import urlparse

urlparse('https://www.mcdonalds.com/us/en-us.html').hostname
# 'www.mcdonalds.com'

Answer 2

URL 解析起来并不简单。看看标准库中的urllib module。

以下是删除域后路径的方法：

import urllib.parse

def remove_path(url):
    parsed = urllib.parse.urlparse(url)
    parsed = parsed._replace(path='')
    return urllib.parse.urlunparse(parsed)

df1['url'] = df1['url'].apply(remove_path)

Answer 3

您可以使用 urlparse as suggested by others, or you could also use urlsplit。但是，两者都不会处理 www.cemexusa.com。所以如果你不需要密钥中的方案，你可以使用这样的东西：

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname

df1["Key"] = df1["URL"].apply(to_key)

这是一个完整的工作示例：

import pandas as pd
import io

from urllib.parse import urlsplit

df1_data = io.StringIO("""
URL,Description
https://www.mcdonalds.com/us/en-us.html,Junk Food
https://www.cemexusa.com/find-your-location,Cemex
""")

df2_data = io.StringIO("""
URL,Last Update
https://www.mcdonalds.com,2021
www.cemexusa.com,2020
""")

df1 = pd.read_csv(df1_data)
df2 = pd.read_csv(df2_data)

def to_key(url):
    if "://" not in url:  # or: not re.match("(?:http|ftp|https)://"", url)
        url = f"https://{url}"
    return urlsplit(url).hostname
    
df1["Key"] = df1["URL"].apply(to_key)
df2["Key"] = df2["URL"].apply(to_key)

joined = df1.merge(df2, on="Key", suffixes=("_df1", "_df2"))

# and if you want to get rid of the original urls
joined = joined.drop(["URL_df1", "URL_df2"], axis=1)

print(joined) 的输出将是：

  Description                Key  Last Update
0   Junk Food  www.mcdonalds.com         2021
1       Cemex   www.cemexusa.com         2020

本回答可能还有其他特殊情况没有处理。根据您的数据，您可能还需要处理省略的 www:

urlsplit("https://realpython.com/pandas-merge-join-and-concat").hostname
# realpython.com

urlsplit("https://www.realpython.com").hostname  # also a valid URL
# www.realpython.com

urlparse和urlsplit有什么区别？

这取决于您的用例以及您想要提取的信息。由于您不需要 URL 的 params，我建议使用 urlsplit.

[urlsplit()] is similar to urlparse(), but does not split the params from the URL. https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlsplit

如何从 Python 中的 URL 中删除 .com 和 "https://" 之后的字符串

How to remove string after .com and "https://" from an URL in Python

python

pandas

python-re