用于提取域和子域的正则表达式

Question

我正在尝试将一堆网站剥离到它们的域名，即：

https://www.facebook.org/hello

变成facebook.org.

我正在使用正则表达式模式查找器：

(https?:\/\/)?([wW]{3}\.)?([\w]*.\w*)([\/\w]*)

这可以捕捉到大多数情况，但偶尔会有这样的网站：

http://www.xxxx.wordpress.com/hello

我想剥离到 xxxx.wordpress.com。

如何在识别所有其他正常条目的同时识别这些案例？

Answer 1

尽管 Robert Harvey 提出了一种有用的 urllib.parse 方法，但这是我对正则表达式的尝试：

(?:http[s]?:\/\/)?(?:www\.)?([^/\n\r\s]+\.[^/\n\r\s]+)(?:/)?(\w+)?

见于 regex101.com

解释-

首先，正则表达式检查是否有 https:// 或 http://。如果是，它会忽略它，但会在那之后开始搜索。

然后正则表达式检查 www. - 重要的是要注意这一直是可选的，所以如果用户输入 my website is site.com，site.com 将被匹配。

[^/\n\r\s]+\.[^/\n\r\s]+ 匹配您实际需要的 url，因此它不会有空格或换行符。哦，里面至少要有一个句号(.)。

因为你的问题看起来你也想匹配子目录，所以我在最后添加了 (\w+)?。

TL;DR

第 0 组 - 整个 url

组 1 - 域名

第 2 组 - sub-directory

Answer 2

你的表达似乎工作得很好，它输出了你可能想要的。我只添加了一个 i 标志并稍微修改为：

(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)

正则表达式

如果这不是您想要的表达方式，您可以 modify/change 您的表达方式 regex101.com。

正则表达式电路

您还可以在 jex.im:

中可视化您的表情

Python代码

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)"

test_str = ("https://www.facebook.org/hello\n"
    "http://www.xxxx.wordpress.com/hello\n"
    "http://www.xxxx.yyy.zzz.wordpress.com/hello")

subst = "\3"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

JavaScript演示

const regex = /(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)/gmi;
const str = `https://www.facebook.org/hello
http://www.xxxx.wordpress.com/hello
http://www.xxxx.yyy.zzz.wordpress.com/hello`;
const subst = ``;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Answer 3

print("-------------")

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

    import re
    
    regex = r"(https?:\/\/)?([w]{3}\.)?(\w*.\w*)([\/\w]*)"
    regex1 = r"\.?(microsoft.com.*)"
    test_str = (
    "https://blog.microsoft.com/test.html\n"
    "https://www.blog.microsoft.com/test/test\n"
    "https://microsoft.com\n"
    "http://www.blog.xyz.abc.microsoft.com/test/test\n"
    "https://www.microsoft.com")
    
    subst = "\3"
    if test_str:
        print (test_str)
    
    print ("-----")
    # You can manually specify the number of replacements by changing the 4th argument
    result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
    if result:
        print (result)
    
    print ("-----")
    result = re.sub(regex1, "", result, 0, re.MULTILINE | re.IGNORECASE)
    if result:
        print (result)
    
    print ("-----")

用于提取域和子域的正则表达式

RegEx for extracting domains and subdomains

python

regex

regex-group

regex-greedy

regex-lookarounds

解释-

TL;DR

正则表达式

正则表达式电路

Python代码

JavaScript演示