HTML(5) 是否忽略字素?
Does HTML(5) ignore graphemes?
一个 grapheme is the smallest "unit" in writing. In English, we normally just think of the characters A-Z, but other languages have accents. UTF allows you to add accents to characters to form a grapheme. There's a generalized algorithm 可让您将一系列 UTF 代码点分解为逻辑字素簇(其中每个代码点簇代表一个字素)。
举个例子:
<̖̈̌̍br>̗̘̈̉̊̋
上面文字中有四个字素:<̖̈̌̍
、b
、r
、>̗̘̈̉̊̋
(注意<̖̈̌̍
和>̗̘̈̉̊̋
实际上只是 <
和 >
加上额外的重音)。如果我把它放在 HTML 文档中:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>test</title>
</head>
<body>
<̖̈̌̍br>̗̘̈̉̊̋
</body>
</html>
它无法使用我发现的实验性验证器进行验证。这些验证器似乎使用代码点而不是字素进行解析,因此抱怨 <
之后的重音代码点(无法形成有效的 HTML5 标记)。
鉴于这些验证器是实验性的,我不知道我是否应该完全相信他们的结果。
HTML5 是否忽略字素,只关心代码点?
HTML 规范的
The term Unicode code point means a Unicode scalar value where possible, and an isolated surrogate code point when not. When a conformance requirement is defined in terms of characters or Unicode code points, a pair of code units consisting of a high surrogate followed by a low surrogate must be treated as the single code point represented by the surrogate pair, but isolated surrogates must each be treated as the single code point with the value of the surrogate.
In this specification, the term character, when not qualified as Unicode character, is synonymous with the term Unicode code point.
The term Unicode character is used to mean a Unicode scalar value (i.e. any Unicode code point that is not a surrogate code point).
然后,稍后,在 8.1.2.1 Start tags and 8.1.2.2 End tags 中,它使用 character 一词定义事物(从上面我们知道,它与 Unicode 代码同义点).
这意味着当它遇到<̖̈̌̍
时,它实际上只是在解析代码点序列U+003C, U+0316, U+0308, U+030C, and U+030D。它忽略了字素的概念。
有趣的是,这意味着结束标签 >̗̘̈̉̊̋
仍然是 "valid" HTML 结束标签。代码点的序列是 U+003E, U+0317, U+0318, U+0308, U+0309, U+030A, and U+030B. The first code point (U+003E) is just >
, so it's consumed as a closing tag. The following code points, which are combining code points,然后就解析器而言只是正常的 "text"(这不是很有效的 UTF)。这仍然是一个问题,然后,渲染器将做什么:组合代码点只是被渲染为垃圾,还是将它们与刚刚关闭的标签之前的字符组合?
不过,结论是 HTML 解析中未使用字素。只是代码点。
一个 grapheme is the smallest "unit" in writing. In English, we normally just think of the characters A-Z, but other languages have accents. UTF allows you to add accents to characters to form a grapheme. There's a generalized algorithm 可让您将一系列 UTF 代码点分解为逻辑字素簇(其中每个代码点簇代表一个字素)。
举个例子:
<̖̈̌̍br>̗̘̈̉̊̋
上面文字中有四个字素:<̖̈̌̍
、b
、r
、>̗̘̈̉̊̋
(注意<̖̈̌̍
和>̗̘̈̉̊̋
实际上只是 <
和 >
加上额外的重音)。如果我把它放在 HTML 文档中:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>test</title>
</head>
<body>
<̖̈̌̍br>̗̘̈̉̊̋
</body>
</html>
它无法使用我发现的实验性验证器进行验证。这些验证器似乎使用代码点而不是字素进行解析,因此抱怨 <
之后的重音代码点(无法形成有效的 HTML5 标记)。
鉴于这些验证器是实验性的,我不知道我是否应该完全相信他们的结果。
HTML5 是否忽略字素,只关心代码点?
The term Unicode code point means a Unicode scalar value where possible, and an isolated surrogate code point when not. When a conformance requirement is defined in terms of characters or Unicode code points, a pair of code units consisting of a high surrogate followed by a low surrogate must be treated as the single code point represented by the surrogate pair, but isolated surrogates must each be treated as the single code point with the value of the surrogate.
In this specification, the term character, when not qualified as Unicode character, is synonymous with the term Unicode code point.
The term Unicode character is used to mean a Unicode scalar value (i.e. any Unicode code point that is not a surrogate code point).
然后,稍后,在 8.1.2.1 Start tags and 8.1.2.2 End tags 中,它使用 character 一词定义事物(从上面我们知道,它与 Unicode 代码同义点).
这意味着当它遇到<̖̈̌̍
时,它实际上只是在解析代码点序列U+003C, U+0316, U+0308, U+030C, and U+030D。它忽略了字素的概念。
有趣的是,这意味着结束标签 >̗̘̈̉̊̋
仍然是 "valid" HTML 结束标签。代码点的序列是 U+003E, U+0317, U+0318, U+0308, U+0309, U+030A, and U+030B. The first code point (U+003E) is just >
, so it's consumed as a closing tag. The following code points, which are combining code points,然后就解析器而言只是正常的 "text"(这不是很有效的 UTF)。这仍然是一个问题,然后,渲染器将做什么:组合代码点只是被渲染为垃圾,还是将它们与刚刚关闭的标签之前的字符组合?
不过,结论是 HTML 解析中未使用字素。只是代码点。