使用 Regex Notepad++ 从 URL 中清理和提取子域和域

Question

这是一个简单的文本文件。

URL:

可以有 https:// 或 http://
同时消除尾随 url/ 文件路径
仅提取域 and/or 子域

我有 Notepad++ 和 EditPlus

对其他建议持开放态度？

示例：

https://appspace.com

http://appspace.com/

http://ayurfit.ning.com/main/authorization/signIn

http://bangalore.olx.in/login.php

http://birthdayshoes.com/forum/index.php

http://birthdayshoes.com/forum/register/

http://forums.virtualbox.org/ucp.php

尝试次数：

/(?!.{253})((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.){1,126}+[A-Za-z]{2,6}/ 
^(?:https?://)?([^/.]+(?=\.)|)(\.?[^/.]+\.[^/]+)/?(.+|)$

https://regex101.com/r/hZ4cL4/4

在其他机器上尝试了很多作为 Regex101 的例子

也找到了这个小金块。我会 post 一旦我理解了它的不同之处。

Regular Expression - Extract subdomain & domain

Answer 1

You could simply extract anything that is between two . Additionally you could use lookbehinds for http(s) and lookahead for the filepath to fine tune your results.

Answer 2

对于以协议开头的链接，您可以使用以下正则表达式：

(?<=://)[\w-]+(?:\.[\w-]+)+\b

见demo

(?<=://) 后视确保在我们要匹配的值之前有 ://，并且整个匹配的文本由 1 个或多个单词字符或连字符的序列组成（[\w-]+) 最后用句点分隔。

使用 Regex Notepad++ 从 URL 中清理和提取子域和域

Clean and extract Subdomains & Domains from URLs using Regex Notepad++

regex

subdomain

url

notepad++