包括 IDN 字符的域名正则表达式 c#
Domain Name Regex Including IDN Characters c#
我希望我的域名不包含超过一个连续的 (.)
、'/'
或任何其他特殊字符。但它可以包含 IDN 字符,例如 Á, ś, etc..
。我可以通过使用此正则表达式来满足所有要求(IDN 除外):
@"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
问题在于此正则表达式也拒绝 IDN 字符。我想要一个允许 IDN 字符的正则表达式。我做了很多研究,但我无法弄清楚。
简介
正则表达式包含一个字符 class,允许您指定 Unicode 通用类别 \p{}
。 MSDN regex documentation 包含以下内容:
\p{ name }
Matches any single character in the Unicode general
category or named block specified by name.
此外,作为旁注,我注意到您的正则表达式包含未转义的 .
。在正则表达式中,点字符 .
具有 任何字符的特殊含义(换行符除外,除非另有说明)。您可能需要将其更改为 \.
以确保正常运行。
代码
编辑您现有的代码以包含 Unicode 字符 classes 而不是简单的 ASCII 字母,您应该获得以下内容:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
说明
\p{L}
代表Unicode字符class任意字母在任意language/script
\p{N}
表示任何 language/script 中任何数字的 Unicode 字符 class (根据您的字符样本,您可能可以保留 0-9
,但我想我会向您展示一般概念并为您提供一些额外信息)
This site 快速概括地概述了最常用的 Unicode 类别。
\p{L}
or \p{Letter}
: any kind of letter from any language.
\p{Ll}
or \p{Lowercase_Letter}
: a lowercase letter that has an uppercase variant.
\p{Lu}
or \p{Uppercase_Letter}
: an uppercase letter that has a lowercase variant.
\p{Lt}
or \p{Titlecase_Letter}
: a letter that appears at the start of a word when only the first letter of the word is
capitalized.
\p{L&}
or \p{Cased_Letter}
: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).
\p{Lm}
or \p{Modifier_Letter}
: a special character that is used like a letter.
\p{Lo}
or \p{Other_Letter}
: a letter or ideograph that does not have lowercase and uppercase variants.
\p{M}
or \p{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn}
or \p{Non_Spacing_Mark}
: a character intended to be combined with another character without taking up extra space (e.g.
accents, umlauts, etc.).
\p{Mc}
or \p{Spacing_Combining_Mark}
: a character intended to be combined with another character that takes up extra space (vowel
signs in many Eastern languages).
\p{Me}
or \p{Enclosing_Mark}
: a character that encloses the character is is combined with (circle, square, keycap, etc.).
\p{Z}
or \p{Separator}
: any kind of whitespace or invisible separator.
\p{Zs}
or \p{Space_Separator}
: a whitespace character that is invisible, but does take up space.
\p{Zl}
or \p{Line_Separator}
: line separator character U+2028.
\p{Zp}
or \p{Paragraph_Separator}
: paragraph separator character U+2029.
\p{S}
or \p{Symbol}
: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm}
or \p{Math_Symbol}
: any mathematical symbol.
\p{Sc}
or \p{Currency_Symbol}
: any currency sign.
\p{Sk}
or \p{Modifier_Symbol}
: a combining character (mark) as a full character on its own.
\p{So}
or \p{Other_Symbol}
: various symbols that are not math symbols, currency signs, or combining characters.
\p{N}
or \p{Number}
: any kind of numeric character in any script.
\p{Nd}
or \p{Decimal_Digit_Number}
: a digit zero through nine in any script except ideographic scripts.
\p{Nl}
or \p{Letter_Number}
: a number that looks like a letter, such as a Roman numeral.
\p{No}
or \p{Other_Number}
: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from
ideographic scripts).
\p{P}
or \p{Punctuation}
: any kind of punctuation character.
\p{Pd}
or \p{Dash_Punctuation}
: any kind of hyphen or dash.
\p{Ps}
or \p{Open_Punctuation}
: any kind of opening bracket.
\p{Pe}
or \p{Close_Punctuation}
: any kind of closing bracket.
\p{Pi}
or \p{Initial_Punctuation}
: any kind of opening quote.
\p{Pf}
or \p{Final_Punctuation}
: any kind of closing quote.
\p{Pc}
or \p{Connector_Punctuation}
: a punctuation character such as an underscore that connects words.
\p{Po}
or \p{Other_Punctuation}
: any kind of punctuation character that is not a dash, bracket, quote or connector.
\p{C}
or \p{Other}
: invisible control characters and unused code points.
\p{Cc}
or \p{Control}
: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.
\p{Cf}
or \p{Format}
: invisible formatting indicator.
\p{Co}
or \p{Private_Use}
: any code point reserved for private use.
\p{Cs}
or \p{Surrogate}
: one half of a surrogate pair in UTF-16 encoding.
\p{Cn}
or \p{Unassigned}
: any code point to which no character has been assigned.
这个问题不能用允许各种 Unicode 字符 类 的简单正则表达式来回答,因为 IDN Character Categorization 定义了许多非法字符并且还有其他限制。
据我所知,IDN 域名以 xn-- 开头。这样在域名中启用扩展的 UTF-8 字符,例如大众汽车.cn是有效域名(中文volkswagen)。要使用正则表达式验证此域名,您需要让 http://xn--3oq18vl8pn36a.cn/(相当于大众汽车的 ACE)通过。
为此,您需要将域名编码为 ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn 带有一个名为 idn
的 CLI 工具,它允许您将 UTF-8 格式的主机名转换为 ACE 编码。然后可以将生成的字符串用作 UTF-8 URL.
的 ACE 编码等价物
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
受 paka and timgws 启发,我建议使用以下正则表达式,它应该涵盖大多数域:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
以下是一些示例:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
可视化
一些有用的链接
* Top level domains - Delegated string
* Internationalized Domain Names (IDN) FAQ
* Internationalized Domain Names Support page from Oracle's International Language Environment Guide
如果您想使用 Unicode 字符 类 \p{},您应该使用以下 as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
“可能需要验证域或国际化域名的原因有多种。
- 仅接受通过 DNS 查询探测时解析的功能域
- 接受可能充当(注册并随后解析,或仅出于信息目的)作为域名的字符串
根据需求的性质,验证域名的方式有很大不同。
对于验证域名,仅从纯技术规范的角度来看,不管它的可解析性 vis-a-vis DNS,是一个比仅仅用一定数量的 Unicode 编写正则表达式更复杂的问题 类.
有许多 RFC(5891、5892、5893、5894 和 5895)一起定义了有效域(具体为 IDN,一般为域)名称的结构。它不仅涉及各种 Unicode 字符 类,还包括一些上下文特定规则,这些规则需要自己的 full-fledged 算法。通常,所有领先的编程语言和框架都提供了一种根据最新的 IDNA 协议(即 IDNA 2008)验证域名的方法。
C# 提供了一个库:System.Globalization.IdnMapping,它提供将域名转换为等效的 punycode 版本的功能。您可以使用此库来检查用户提交的域是否符合 IDNA 规范。如果不是,在转换过程中您将遇到 error/exception,从而验证用户提交。
如果有兴趣深入研究该主题,请参阅“普遍接受度指导小组”(https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/) 制作的非常详尽的研究文件。
此外,如果有兴趣了解实施国际化电子邮件解决方案的整个过程、挑战和问题,还可以阅读以下 RFC:RFC 6530(国际化电子邮件的概述和框架)、RFC 6531(国际化电子邮件的 SMTP 扩展)、RFC 6532(国际化电子邮件 Headers)、RFC 6533(国际化传递状态和处置通知)、RFC 6855(IMAP 对 UTF-8 的支持)、RFC 6856(Post Office 协议版本 3 (POP3) 支持 UTF-8)、RFC 6857(Post-Delivery 国际化电子邮件的消息降级)、RFC 6858(国际化电子邮件的简化 POP 和 IMAP 降级)。)。
我希望我的域名不包含超过一个连续的 (.)
、'/'
或任何其他特殊字符。但它可以包含 IDN 字符,例如 Á, ś, etc..
。我可以通过使用此正则表达式来满足所有要求(IDN 除外):
@"^(?:[a-zA-Z0-9][a-zA-Z0-9-_]*\.)+[a-zA-Z0-9]{2,}$";
问题在于此正则表达式也拒绝 IDN 字符。我想要一个允许 IDN 字符的正则表达式。我做了很多研究,但我无法弄清楚。
简介
正则表达式包含一个字符 class,允许您指定 Unicode 通用类别 \p{}
。 MSDN regex documentation 包含以下内容:
\p{ name }
Matches any single character in the Unicode general category or named block specified by name.
此外,作为旁注,我注意到您的正则表达式包含未转义的 .
。在正则表达式中,点字符 .
具有 任何字符的特殊含义(换行符除外,除非另有说明)。您可能需要将其更改为 \.
以确保正常运行。
代码
编辑您现有的代码以包含 Unicode 字符 classes 而不是简单的 ASCII 字母,您应该获得以下内容:
^(?:[\p{L}\p{N}][\p{L}\p{N}-_]*.)+[\p{L}\p{N}]{2,}$
说明
\p{L}
代表Unicode字符class任意字母在任意language/script\p{N}
表示任何 language/script 中任何数字的 Unicode 字符 class (根据您的字符样本,您可能可以保留0-9
,但我想我会向您展示一般概念并为您提供一些额外信息)
This site 快速概括地概述了最常用的 Unicode 类别。
\p{L}
or\p{Letter}
: any kind of letter from any language.
\p{Ll}
or\p{Lowercase_Letter}
: a lowercase letter that has an uppercase variant.\p{Lu}
or\p{Uppercase_Letter}
: an uppercase letter that has a lowercase variant.\p{Lt}
or\p{Titlecase_Letter}
: a letter that appears at the start of a word when only the first letter of the word is capitalized.\p{L&}
or\p{Cased_Letter}
: a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt).\p{Lm}
or\p{Modifier_Letter}
: a special character that is used like a letter.\p{Lo}
or\p{Other_Letter}
: a letter or ideograph that does not have lowercase and uppercase variants.\p{M}
or\p{Mark}
: a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\p{Mn}
or\p{Non_Spacing_Mark}
: a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.).\p{Mc}
or\p{Spacing_Combining_Mark}
: a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages).\p{Me}
or\p{Enclosing_Mark}
: a character that encloses the character is is combined with (circle, square, keycap, etc.).\p{Z}
or\p{Separator}
: any kind of whitespace or invisible separator.
\p{Zs}
or\p{Space_Separator}
: a whitespace character that is invisible, but does take up space.\p{Zl}
or\p{Line_Separator}
: line separator character U+2028.\p{Zp}
or\p{Paragraph_Separator}
: paragraph separator character U+2029.\p{S}
or\p{Symbol}
: math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm}
or\p{Math_Symbol}
: any mathematical symbol.\p{Sc}
or\p{Currency_Symbol}
: any currency sign.\p{Sk}
or\p{Modifier_Symbol}
: a combining character (mark) as a full character on its own.\p{So}
or\p{Other_Symbol}
: various symbols that are not math symbols, currency signs, or combining characters.\p{N}
or\p{Number}
: any kind of numeric character in any script.
\p{Nd}
or\p{Decimal_Digit_Number}
: a digit zero through nine in any script except ideographic scripts.\p{Nl}
or\p{Letter_Number}
: a number that looks like a letter, such as a Roman numeral.\p{No}
or\p{Other_Number}
: a superscript or subscript digit, or a number that is not a digit 0–9 (excluding numbers from ideographic scripts).\p{P}
or\p{Punctuation}
: any kind of punctuation character.
\p{Pd}
or\p{Dash_Punctuation}
: any kind of hyphen or dash.\p{Ps}
or\p{Open_Punctuation}
: any kind of opening bracket.\p{Pe}
or\p{Close_Punctuation}
: any kind of closing bracket.\p{Pi}
or\p{Initial_Punctuation}
: any kind of opening quote.\p{Pf}
or\p{Final_Punctuation}
: any kind of closing quote.\p{Pc}
or\p{Connector_Punctuation}
: a punctuation character such as an underscore that connects words.\p{Po}
or\p{Other_Punctuation}
: any kind of punctuation character that is not a dash, bracket, quote or connector.\p{C}
or\p{Other}
: invisible control characters and unused code points.
\p{Cc}
or\p{Control}
: an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F.\p{Cf}
or\p{Format}
: invisible formatting indicator.\p{Co}
or\p{Private_Use}
: any code point reserved for private use.\p{Cs}
or\p{Surrogate}
: one half of a surrogate pair in UTF-16 encoding.\p{Cn}
or\p{Unassigned}
: any code point to which no character has been assigned.
这个问题不能用允许各种 Unicode 字符 类 的简单正则表达式来回答,因为 IDN Character Categorization 定义了许多非法字符并且还有其他限制。
据我所知,IDN 域名以 xn-- 开头。这样在域名中启用扩展的 UTF-8 字符,例如大众汽车.cn是有效域名(中文volkswagen)。要使用正则表达式验证此域名,您需要让 http://xn--3oq18vl8pn36a.cn/(相当于大众汽车的 ACE)通过。
为此,您需要将域名编码为 ASCII Compatible Encoding (ACE) using GNU Libidn (or any other library that implements IDNA), Doc/PDF.
Libidn 带有一个名为 idn
的 CLI 工具,它允许您将 UTF-8 格式的主机名转换为 ACE 编码。然后可以将生成的字符串用作 UTF-8 URL.
$ idn --quiet -a 大众汽车.cn
xn--3oq18vl8pn36a.cn
受 paka and timgws 启发,我建议使用以下正则表达式,它应该涵盖大多数域:
^(?!-)(xn--)?[a-zA-Z0-9][a-zA-Z0-9-_]{0,61}[a-zA-Z0-9]{0,1}\.(?!-)(xn--)?([a-zA-Z0-9\-]{1,50}|[a-zA-Z0-9-]{1,30}\.[a-zA-Z]{2,})$
以下是一些示例:
#Valid
xn-fsqu00a.xn-0zwm56d
xn-fsqu00a.xn--vermgensberatung-pwb
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
google.com.au
i.oh1.me
wow.british-library.uk
xn--whosebug.com
Whosebug.xn--com
Whosebug.co.uk
0-0O_.COM
a.net
0-0O.COM
0-OZ.CO.uk
0-TENSION.COM.br
0-WH-AO14-0.COM-com.net
a-1234567890-1234567890-1234567890-1234567890-1234567890-1234-z.eu.us
#Invalid
-0-0O.COM
0-0O.-COM
-a.dot
a-1234567890-1234567890-1234567890-1234567890-1234567890-12345-z.eu.us
可视化
一些有用的链接 * Top level domains - Delegated string * Internationalized Domain Names (IDN) FAQ * Internationalized Domain Names Support page from Oracle's International Language Environment Guide
如果您想使用 Unicode 字符 类 \p{},您应该使用以下 as specified by the IDN FAQ:
[ \P{Changes_When_NFKC_Casefolded}
- \p{c} - \p{z}
- \p{s} - \p{p} - \p{nl} - \p{no} - \p{me}
- \p{HST=L} - \p{HST=V} - \p{HST=V}
- \p{block=Combining_Diacritical_Marks_For_Symbols}
- \p{block=Musical_Symbols}
- \p{block=Ancient_Greek_Musical_Notation}
- [\u0640 \u07FA \u302E \u302F \u3031-\u3035 \u303B]
+ [\u00B7 \u0375 \u05F3 \u05F4 \u30FB]
+ [\u002D \u06FD \u06FE \u0F0B \u3007]
+ [\u00DF \u03C2]
+ \p{JoinControl}]
“可能需要验证域或国际化域名的原因有多种。
- 仅接受通过 DNS 查询探测时解析的功能域
- 接受可能充当(注册并随后解析,或仅出于信息目的)作为域名的字符串
根据需求的性质,验证域名的方式有很大不同。
对于验证域名,仅从纯技术规范的角度来看,不管它的可解析性 vis-a-vis DNS,是一个比仅仅用一定数量的 Unicode 编写正则表达式更复杂的问题 类.
有许多 RFC(5891、5892、5893、5894 和 5895)一起定义了有效域(具体为 IDN,一般为域)名称的结构。它不仅涉及各种 Unicode 字符 类,还包括一些上下文特定规则,这些规则需要自己的 full-fledged 算法。通常,所有领先的编程语言和框架都提供了一种根据最新的 IDNA 协议(即 IDNA 2008)验证域名的方法。
C# 提供了一个库:System.Globalization.IdnMapping,它提供将域名转换为等效的 punycode 版本的功能。您可以使用此库来检查用户提交的域是否符合 IDNA 规范。如果不是,在转换过程中您将遇到 error/exception,从而验证用户提交。
如果有兴趣深入研究该主题,请参阅“普遍接受度指导小组”(https://uasg.tech/), titled, "UASG 018A UA Compliance of Some Programming Language Libraries and Frameworks (https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as "UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN" (https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/) 制作的非常详尽的研究文件。
此外,如果有兴趣了解实施国际化电子邮件解决方案的整个过程、挑战和问题,还可以阅读以下 RFC:RFC 6530(国际化电子邮件的概述和框架)、RFC 6531(国际化电子邮件的 SMTP 扩展)、RFC 6532(国际化电子邮件 Headers)、RFC 6533(国际化传递状态和处置通知)、RFC 6855(IMAP 对 UTF-8 的支持)、RFC 6856(Post Office 协议版本 3 (POP3) 支持 UTF-8)、RFC 6857(Post-Delivery 国际化电子邮件的消息降级)、RFC 6858(国际化电子邮件的简化 POP 和 IMAP 降级)。)。