包括 IDN 字符的域名正则表达式 c#

Domain Name Regex Including IDN Characters c#

我希望我的域名不包含超过一个连续的 (.)'/' 或任何其他特殊字符。但它可以包含 IDN 字符,例如 Á, ś, etc..。我可以通过使用此正则表达式来满足所有要求(IDN 除外):

@"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";

问题在于此正则表达式也拒绝 IDN 字符。我想要一个允许 IDN 字符的正则表达式。我做了很多研究,但我无法弄清楚。

简介

正则表达式包含一个字符 class,允许您指定 Unicode 通用类别 \p{}MSDN regex documentation 包含以下内容:

\p{ name } Matches any single character in the Unicode general category or named block specified by name.

此外,作为旁注,我注意到您的正则表达式包含未转义的 .。在正则表达式中,点字符 . 具有 任何字符的特殊含义(换行符除外,除非另有说明)。您可能需要将其更改为 \. 以确保正常运行。


代码

编辑您现有的代码以包含 Unicode 字符 classes 而不是简单的 ASCII 字母,您应该获得以下内容:

^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$

说明

  • \p{L}代表Unicode字符class任意字母在任意language/script
  • \p{N} 表示任何 language/script 中任何数字的 Unicode 字符 class (根据您的字符样本,您可能可以保留 0-9,但我想我会向您展示一般概念并为您提供一些额外信息)

This site 快速概括地概述了最常用的 Unicode 类别。

  • \p{L} or \p{Letter}: any kind of letter from any language.
    • \p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.
    • \p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.
    • \p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.
    • \p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
    • \p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.
    • \p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.
  • \p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
    • \p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).
    • \p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).
    • \p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).
  • \p{Z} or \p{Separator}: any kind of whitespace or invisible separator.
    • \p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.
    • \p{Zl} or \p{Line_Separator}: line separator character U+2028.
    • \p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.
  • \p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.
    • \p{Sm} or \p{Math_Symbol}: any mathematical symbol.
    • \p{Sc} or \p{Currency_Symbol}: any currency sign.
    • \p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.
    • \p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.
  • \p{N} or \p{Number}: any kind of numeric character in any script.
    • \p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.
    • \p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.
    • \p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).
  • \p{P} or \p{Punctuation}: any kind of punctuation character.
    • \p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.
    • \p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.
    • \p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.
    • \p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.
    • \p{Pf} or \p{Final_Punctuation}: any kind of closing quote.
    • \p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.
    • \p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.
  • \p{C} or \p{Other}: invisible control characters and unused code points.
    • \p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
    • \p{Cf} or \p{Format}: invisible formatting indicator.
    • \p{Co} or \p{Private_Use}: any code point reserved for private use.
    • \p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.
    • \p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

这个问题不能用允许各种 Unicode 字符 类 的简单正则表达式来回答,因为 IDN Character Categorization 定义了许多非法字符并且还有其他限制。

据我所知,IDN 域名以 xn-- 开头。这样在域名中启用扩展的 UTF-8 字符,例如大众汽车.cn是有效域名(中文volkswagen)。要使用正则表达式验证此域名,您需要让 http://xn--3oq18vl8pn36a.cn/(相当于大众汽车的 ACE)通过。
为此,您需要将域名编码为 ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.

Libidn 带有一个名为 idn 的 CLI 工具,它允许您将 UTF-8 格式的主机名转换为 ACE 编码。然后可以将生成的字符串用作 UTF-8 URL.

的 ACE 编码等价物
  $ idn --quiet -a 大众汽车.cn
  xn--3oq18vl8pn36a.cn

paka and timgws 启发,我建议使用以下正则表达式,它应该涵盖大多数域:

^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$

以下是一些示例:

#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us

Demo

可视化

一些有用的链接 * Top level domains - Delegated string * Internationalized Domain Names (IDN) FAQ * Internationalized Domain Names Support page from Oracle's International Language Environment Guide

如果您想使用 Unicode 字符 类 \p{},您应该使用以下 as specified by the IDN FAQ:

[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]

另请参阅:Perl Unicode properties

“可能需要验证域或国际化域名的原因有多种。

  1. 仅接受通过 DNS 查询探测时解析的功能域
  2. 接受可能充当(注册并随后解析,或仅出于信息目的)作为域名的字符串

根据需求的性质,验证域名的方式有很大不同。

对于验证域名,仅从纯技术规范的角度来看,不管它的可解析性 vis-a-vis DNS,是一个比仅仅用一定数量的 Unicode 编写正则表达式更复杂的问题 类.

有许多 RFC(5891、5892、5893、5894 和 5895)一起定义了有效域(具体为 IDN,一般为域)名称的结构。它不仅涉及各种 Unicode 字符 类,还包括一些上下文特定规则,这些规则需要自己的 full-fledged 算法。通常,所有领先的编程语言和框架都提供了一种根据最新的 IDNA 协议(即 IDNA 2008)验证域名的方法。

C# 提供了一个库:System.Globalization.IdnMapping,它提供将域名转换为等效的 punycode 版本的功能。您可以使用此库来检查用户提交的域是否符合 IDNA 规范。如果不是,在转换过程中您将遇到 error/exception,从而验证用户提交。

如果有兴趣深入研究该主题,请参阅“普遍接受度指导小组”(https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/) 制作的非常详尽的研究文件。

此外,如果有兴趣了解实施国际化电子邮件解决方案的整个过程、挑战和问题,还可以阅读以下 RFC:RFC 6530(国际化电子邮件的概述和框架)、RFC 6531(国际化电子邮件的 SMTP 扩展)、RFC 6532(国际化电子邮件 Headers)、RFC 6533(国际化传递状态和处置通知)、RFC 6855(IMAP 对 UTF-8 的支持)、RFC 6856(Post Office 协议版本 3 (POP3) 支持 UTF-8)、RFC 6857(Post-Delivery 国际化电子邮件的消息降级)、RFC 6858(国际化电子邮件的简化 POP 和 IMAP 降级)。)。