包括 IDN 字符的域名正则表达式 c#

Question

我希望我的域名不包含超过一个连续的 (.)、'/' 或任何其他特殊字符。但它可以包含 IDN 字符，例如 Á, ś, etc..。我可以通过使用此正则表达式来满足所有要求（IDN 除外）：

@"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";

问题在于此正则表达式也拒绝 IDN 字符。我想要一个允许 IDN 字符的正则表达式。我做了很多研究，但我无法弄清楚。

Answer 1

简介

正则表达式包含一个字符 class，允许您指定 Unicode 通用类别 \p{}。 MSDN regex documentation 包含以下内容：

\p{ name } Matches any single character in the Unicode general category or named block specified by name.

此外，作为旁注，我注意到您的正则表达式包含未转义的 .。在正则表达式中，点字符 . 具有 任何字符的特殊含义（换行符除外，除非另有说明）。您可能需要将其更改为 \. 以确保正常运行。

代码

编辑您现有的代码以包含 Unicode 字符 classes 而不是简单的 ASCII 字母，您应该获得以下内容：

^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$

说明

\p{L}代表Unicode字符class任意字母在任意language/script
\p{N} 表示任何 language/script 中任何数字的 Unicode 字符 class （根据您的字符样本，您可能可以保留 0-9，但我想我会向您展示一般概念并为您提供一些额外信息）

This site 快速概括地概述了最常用的 Unicode 类别。

\p{L} or \p{Letter}: any kind of letter from any language.

\p{Ll} or \p{Lowercase_Letter}: a lowercase letter that has an uppercase variant.

\p{Lu} or \p{Uppercase_Letter}: an uppercase letter that has a lowercase variant.

\p{Lt} or \p{Titlecase_Letter}: a letter that appears at the start of a word when only the first letter of the word is capitalized.

\p{L&} or \p{Cased_Letter}: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).

\p{Lm} or \p{Modifier_Letter}: a special character that is used like a letter.

\p{Lo} or \p{Other_Letter}: a letter or ideograph that does not have lowercase and uppercase variants.

\p{M} or \p{Mark}: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

\p{Mn} or \p{Non_Spacing_Mark}: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).

\p{Mc} or \p{Spacing_Combining_Mark}: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).

\p{Me} or \p{Enclosing_Mark}: a character that encloses the character is is combined with (circle, square, keycap, etc.).

\p{Z} or \p{Separator}: any kind of whitespace or invisible separator.

\p{Zs} or \p{Space_Separator}: a whitespace character that is invisible, but does take up space.

\p{Zl} or \p{Line_Separator}: line separator character U+2028.

\p{Zp} or \p{Paragraph_Separator}: paragraph separator character U+2029.

\p{S} or \p{Symbol}: math symbols, currency signs, dingbats, box-drawing characters, etc.

\p{Sm} or \p{Math_Symbol}: any mathematical symbol.

\p{Sc} or \p{Currency_Symbol}: any currency sign.

\p{Sk} or \p{Modifier_Symbol}: a combining character (mark) as a full character on its own.

\p{So} or \p{Other_Symbol}: various symbols that are not math symbols, currency signs, or combining characters.

\p{N} or \p{Number}: any kind of numeric character in any script.

\p{Nd} or \p{Decimal_Digit_Number}: a digit zero through nine in any script except ideographic scripts.

\p{Nl} or \p{Letter_Number}: a number that looks like a letter, such as a Roman numeral.

\p{No} or \p{Other_Number}: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).

\p{P} or \p{Punctuation}: any kind of punctuation character.

\p{Pd} or \p{Dash_Punctuation}: any kind of hyphen or dash.

\p{Ps} or \p{Open_Punctuation}: any kind of opening bracket.

\p{Pe} or \p{Close_Punctuation}: any kind of closing bracket.

\p{Pi} or \p{Initial_Punctuation}: any kind of opening quote.

\p{Pf} or \p{Final_Punctuation}: any kind of closing quote.

\p{Pc} or \p{Connector_Punctuation}: a punctuation character such as an underscore that connects words.

\p{Po} or \p{Other_Punctuation}: any kind of punctuation character that is not a dash, bracket, quote or connector.

\p{C} or \p{Other}: invisible control characters and unused code points.

\p{Cc} or \p{Control}: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.

\p{Cf} or \p{Format}: invisible formatting indicator.

\p{Co} or \p{Private_Use}: any code point reserved for private use.

\p{Cs} or \p{Surrogate}: one half of a surrogate pair in UTF-16 encoding.

\p{Cn} or \p{Unassigned}: any code point to which no character has been assigned.

Answer 2

这个问题不能用允许各种 Unicode 字符类的简单正则表达式来回答，因为 IDN Character Categorization 定义了许多非法字符并且还有其他限制。

据我所知，IDN 域名以 xn-- 开头。这样在域名中启用扩展的 UTF-8 字符，例如大众汽车.cn是有效域名（中文volkswagen）。要使用正则表达式验证此域名，您需要让 http://xn--3oq18vl8pn36a.cn/（相当于大众汽车的 ACE）通过。
为此，您需要将域名编码为 ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.

Libidn 带有一个名为 idn 的 CLI 工具，它允许您将 UTF-8 格式的主机名转换为 ACE 编码。然后可以将生成的字符串用作 UTF-8 URL.

的 ACE 编码等价物

  $ idn --quiet -a 大众汽车.cn
  xn--3oq18vl8pn36a.cn

受 paka and timgws 启发，我建议使用以下正则表达式，它应该涵盖大多数域：

^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$

以下是一些示例：

#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us

Demo

可视化

一些有用的链接 * Top level domains - Delegated string * Internationalized Domain Names (IDN) FAQ * Internationalized Domain Names Support page from Oracle's International Language Environment Guide

如果您想使用 Unicode 字符类 \p{}，您应该使用以下 as specified by the IDN FAQ:

[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]

另请参阅：Perl Unicode properties

Answer 3

“可能需要验证域或国际化域名的原因有多种。

仅接受通过 DNS 查询探测时解析的功能域
接受可能充当（注册并随后解析，或仅出于信息目的）作为域名的字符串

根据需求的性质，验证域名的方式有很大不同。

对于验证域名，仅从纯技术规范的角度来看，不管它的可解析性 vis-a-vis DNS，是一个比仅仅用一定数量的 Unicode 编写正则表达式更复杂的问题类.

有许多 RFC（5891、5892、5893、5894 和 5895）一起定义了有效域（具体为 IDN，一般为域）名称的结构。它不仅涉及各种 Unicode 字符类，还包括一些上下文特定规则，这些规则需要自己的 full-fledged 算法。通常，所有领先的编程语言和框架都提供了一种根据最新的 IDNA 协议（即 IDNA 2008）验证域名的方法。

C# 提供了一个库：System.Globalization.IdnMapping，它提供将域名转换为等效的 punycode 版本的功能。您可以使用此库来检查用户提交的域是否符合 IDNA 规范。如果不是，在转换过程中您将遇到 error/exception，从而验证用户提交。

如果有兴趣深入研究该主题，请参阅“普遍接受度指导小组”(https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/) 制作的非常详尽的研究文件。

此外，如果有兴趣了解实施国际化电子邮件解决方案的整个过程、挑战和问题，还可以阅读以下 RFC：RFC 6530（国际化电子邮件的概述和框架)、RFC 6531（国际化电子邮件的 SMTP 扩展）、RFC 6532（国际化电子邮件 Headers）、RFC 6533（国际化传递状态和处置通知）、RFC 6855（IMAP 对 UTF-8 的支持）、RFC 6856（Post Office 协议版本 3 (POP3) 支持 UTF-8)、RFC 6857（Post-Delivery 国际化电子邮件的消息降级）、RFC 6858（国际化电子邮件的简化 POP 和 IMAP 降级）。）。

包括 IDN 字符的域名正则表达式 c#

Domain Name Regex Including IDN Characters c#

c#

regex

idn

web

简介

代码

说明