Python3 域名正则表达式中的高级分组

Question

我有一个用 python3 编写的程序，它应该每天解析几个域名并推断数据。
解析后的数据应作为搜索功能、聚合（统计和图表）的输入，并为使用该程序的分析师节省一些时间。

如你所知：我真的没有时间研究机器学习（这似乎是一个很好的解决方案），所以我选择从我已经在使用的正则表达式开始。
我已经在 Whosebug 内外搜索了正则表达式文档，并在 regex101 上使用调试器，但我仍然没有找到一种方法来完成我需要的事情。
编辑（24/6/2019）： 我提到机器学习是因为我需要一个复杂的解析器，它可以尽可能地自动化。这对于进行黑名单、白名单等自动选择很有用

解析器应该考虑几件事：

最多 126 个子域加上 TLD
每个子域名不得超过 64 个字符
每个子域只能包含字母数字字符和 - 字符
每个子域不得以 - 字符开头或结尾
TLD 不得超过 64 个字符
TLD 不能只包含数字

但我要更深入一点：

第一个字符串可以（可选）包含 "usage type"，例如 cpanel.、mail.、webdisk.、autodiscover. 等等...（或者可能是一个符号 www.)
TLD 可以（可选）包含像 .co、.gov、.edu 等这样的助词（例如 .co.uk）
TLD 的最后一部分现在还没有真正根据 ccTLD/gTLDs 的任何列表进行检查，我认为将来也不会

我认为对解决问题有用的是可选使用类型的正则表达式组，每个子域一个，TLD 一个（可选粒子必须在 TLD 组内）
考虑到这些规则，我想出了一个解决方案：

^(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?\.)?([a-z\d][a-z\d\-]{0,62}[a-z\d])?((\.[a-z\d][a-z\d\-]{0,62}[a-z\d]){0,124}?(?P<TLD>(\.co|\.com|\.edu|\.net|\.org|\.gov)?\.(?!\d+)[a-z\d]{1,64})$

上述解决方案未return 预期结果

我在这里报告几个例子：

要解析的几个字符串

without.further.ado.lets.travel.the.forest.com  
www.without.further.ado.lets.travel.the.forest.gov.it

我希望找到的群组

全匹配without.further.ado.lets.travel.the.forest.com
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.com
全匹配www.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3further
group4ado
group5lets
group6travel
group7the
group8forest
groupTLD.gov.it

我找到的群组

全匹配without.further.ado.lets.travel.the.forest.com
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.com
全匹配www.without.further.ado.lets.travel.the.forest.gov.it
groupUSAGEwww.
group2without
group3.further.ado.lets.travel.the.forest
group4.forest
groupTLD.gov.it
group6.gov

正如您从示例中看到的那样，一些粒子被发现两次，无论如何这不是我想要的行为。任何编辑公式的尝试都会导致意外输出。
关于找到预期结果的方法有什么想法吗？

Answer 1

不知道能不能完全按照你的要求输出。我认为使用单一模式无法捕获不同组（group2、group3、..）的结果。

我找到了一种使用 regex 模块几乎可以达到预期结果的方法。

match = regex.search(r'^(?:(?P<USAGE>autodiscover|correo|cpanel|ftp|mail|new|server|webdisk|webhost|webmail[\d]?|wiki|www[\d]?)\.)?(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.){0,124}?(?P<TLD>(?:co|com|edu|net|org|gov)?\.(?!\d+)[a-z\d]{1,64})$', 'www.without.further.ado.lets.travel.the.forest.gov.it')

输出：

match.captures(0)
['www.without.further.ado.lets.travel.the.forest.gov.it']
match.captures[1] or match.captures('USAGE')
['www.']
match.captures(2)
['without', 'further', 'ado', 'lets', 'travel', 'the', 'forest']
match.captures(3) or match.captures('TLD')
['gov.it']

在这里，为了避免将 . 加入群组，我将其添加到 non-capturing 群组中，如下所示

(?:([a-z\d][a-z\d\-]{0,62}[a-z\d])\.)

希望对您有所帮助。

Answer 2

这是一项简单的 well-defined 任务。没有模糊，没有复杂性，没有猜测，只有一系列简单的测试来弄清楚清单上的所有内容。我不知道 "machine learning" 如何合适或有帮助。甚至正则表达式也完全没有必要。

我还没有实现你想要验证的所有内容，但填补缺失的部分并不难。

import string

double_tld = ['gov', 'edu', 'co', 'add_others_you_need']

# we'll use this instead of regex to check subdomain validity
valid_sd_characters = string.ascii_letters + string.digits + '-'
valid_trans = str.maketrans('', '', valid_sd_characters)

def is_invalid_sd(sd):
    return sd.translate(valid_trans) != ''

def check_hostname(hostname):
    subdomains = hostname.split('.')

    # each subdomain can contain only alphanumeric characters and
    # the - character
    invalid_parts = list(filter(is_invalid_sd, subdomains))
    # TODO react if there are any invalid parts

    # "the TLD can (optionally) contain a particle like
    # .co, .gov, .edu and so on (.co.uk for example)"
    if subdomains[-2] in double_tld:
        subdomains[-2] += '.' + subdomains[-1]
        subdomains = subdomains[:-1]

    # "a maximum number of 126 subdomains plus the TLD"
    # TODO check list length of subdomains

    # "each subdomain must not begin or end with the - character"
    # "the TLD must not be longer than 64 characters"
    # "the TLD must not contain only digits"
    # TODO write loop, check first and last characters, length, isnumeric

    # TODO return something

Python3 域名正则表达式中的高级分组

Advanced grouping in domain name regex with Python3

regex

domain-name

regex-group

python-3.x