使用 urlsplit 仅从 url 获取域名

Question

我有一个包含不同形式的 url 的数据集（例如 https://whosebug.com, https://www.whosebug.com, whosebug.com），我只需要像 Whosebug.

这样的域名

我使用了 urllib 中的 parse.urlsplit(url)，但在我的情况下效果不佳。

如何只获取域名？

编辑：

我的代码：

def normalization (df):
  df['after_urlsplit'] = df["httpx"].map(lambda x: parse.urlsplit(x))
  return df

normalization(df_sample)

输出：

            httpx                       after_urlsplit
0   https://whosebug.com/       (https, whosebug.com, /, , )
1   https://www.whosebug.com/   (https, www.whosebug.com, /, , )
2   www.whosebug.com/           (, , www.whosebug.com/, , )
3   whosebug.com/               (, , whosebug.com/, , )

Answer 1

新答案，也适用于 url 和主机名

要处理没有协议定义的实例（例如 example.com），最好使用正则表达式：

import re

urls = ['www.whosebug.com',
        'whosebug.com',
        'https://whosebug.com',
        'https://www.whosebug.com/',
        'www.whosebug.com',
        'whosebug.com',
        'https://subdomain.whosebug.com/']

for url in urls:
    host_name = re.search("^(?:.*://)?(.*)$", url).group(1).split('.')[-2]
    print(host_name)

这会在所有情况下打印 Whosebug。

旧答案，仅适用于网址

您可以使用 urlsplit 返回的 netloc 的值，此外还可以进行一些额外的定制以获得您想要的域（部分）：

from urllib.parse import urlsplit

m = urlsplit('http://subdomain.example.com/some/extra/things')

print(m.netloc.split('.')[-2])

这会打印 example.

（但是，这在像 http://localhost/some/path/to/file.txt 这样的 url 上会失败）

Answer 2

处理此类问题的最佳方法是 regex。

Answer 3

您可以使用正则表达式(regex)完成此任务。

import re

URL = "https://www.test.com"
result = re.search("https?:\/\/(www.)?([\w\.\_]+)", URL)
print(result.group(2))

# output: test.com

使用 urlsplit 仅从 url 获取域名

Get only domain name from urls using urlsplit

python

urllib

dataset

新答案，也适用于 url 和主机名

旧答案，仅适用于网址