ECMAScript 2017:从非终结符 StringLiteral 解析为 String 值
ECMAScript 2017: Parsing from nonterminal StringLiteral to String values
我正在尝试理解 ECMAScript 2017.
之后字符串文字到最终字符串值(由代码单元值组成)的转换
相关摘录
5.1.2 词法和 RegExp 文法
A lexical grammar for ECMAScript is given in clause 11. This grammar
has as its terminal symbols Unicode code points that conform to the
rules for SourceCharacter defined in 10.1. It defines a set of
productions, starting from the goal symbol InputElementDiv,
InputElementTemplateTail, or InputElementRegExp, or
InputElementRegExpOrTemplateTail, that describe how sequences of such
code points are translated into a sequence of input elements.
Input elements other than white space and comments form the terminal
symbols for the syntactic grammar for ECMAScript and are called
ECMAScript tokens. These tokens are the reserved words, identifiers,
literals, and punctuators of the ECMAScript language.
5.1.4 句法语法
When a stream of code points is to be parsed as an ECMAScript Script
or Module, it is first converted to a stream of input elements by
repeated application of the lexical grammar; this stream of input
elements is then parsed by a single application of the syntactic
grammar.
和
11 ECMAScript 语言:词法语法
The source text of an ECMAScript Script or Module is first converted
into a sequence of input elements, which are tokens, line terminators,
comments, or white space. The source text is scanned from left to
right, repeatedly taking the longest possible sequence of code points
as the next input element.
11.8.4 字符串文字
StringLiteral ::
" DoubleStringCharacters_opt "
' SingleStringCharacters_opt '
SingleStringCharacters ::
SingleStringCharacter SingleStringCharacters_opt
SingleStringCharacter ::
SourceCharacter but not one of ' or \ or LineTerminator
\ EscapeSequence
LineContinuation
EscapeSequence ::
CharacterEscapeSequence
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
UnicodeEscapeSequence
CharacterEscapeSequence ::
SingleEscapeCharacter
NonEscapeCharacter
NonEscapeCharacter ::
SourceCharacter but not one of EscapeCharacter or LineTerminator
EscapeCharacter ::
SingleEscapeCharacter
DecimalDigit
x
u
11.8.4.3 静态语义:SV
A string literal stands for a value of the String type. The String
value (SV) of the literal is described in terms of code unit values
contributed by the various parts of the string literal.
和
The SV of SingleStringCharacter :: SourceCharacter but not one of ' or
\ or LineTerminator is the UTF16Encoding of the code point value of
SourceCharacter.
The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the
EscapeSequence.
问题
假设我们有字符串文字 'b\ar'
。我现在想按照上面的词法语法和语义语法,将字符串文字变成一组代码单元值。
b\ar
被识别为 CommonToken
b\ar
被进一步识别为 StringLiteral
- StringLiteral 被翻译成 SingleStringCharacters
- SingleStringCharacters 中的每个代码点都转换为 SingleStringCharacter
- 前面没有
\
的每个 SingleStringCharacter 都被转换为 SourceCharacter
\a
被识别为 \EscapeSequence
- EscapeSequence (a) 被翻译成 NonEscapeCharacter
- NonEscapeCharacter 被翻译成 SourceCharacter
- 所有 SourceCharacter 都被翻译成
any Unicode code point
- 最后,应用 SV 规则来获取字符串值和代码单元值
我遇到的问题是 StringLiteral 输入元素现在是:
SourceCharacter, \ SourceCharacter, SourceCharacter
\SourceCharacter没有SV规则,只有\EscapeCharacter.
这让我想知道我是否顺序错误,或者误解了词法和句法语法的应用方式。
我也对如何完全应用 SV 规则感到困惑。因为它们被定义为应用于非终结符号,而不是终结符号(这应该是应用词法语法后的结果)。
非常感谢任何帮助。
好吧,假设我们要使用单个令牌 'b\ar'
,也就是您所说的 StringLiteral
令牌。应用 11.8.4.3 Static Semantics: SV as well as 10.1.1 Static Semantics: UTF16Encoding(cp) 中定义的算法,我们遵循 SV
规则:
StringLiteral::
'
SingleStringCharacters
'
的 SV 是 SingleStringCharacters
的 SV
。
- 展开引号,因为我们只在
SingleStringCharacters
部分递归 运行 SV
,例如SV(b\ar)
SingleStringCharacters::
的SV
SingleStringCharacterSingleStringCharacters
是一个或两个代码单元的序列,即SingleStringCharacter
的SV
然后依次是SingleStringCharacters
的SV
中的所有代码单元。
这表示 "call SV every SingleStringCharacter
appending results"。
SV(b)
SingleStringCharacter::
SourceCharacter
的 SV
但不是 '
或 \
或 LineTerminator
之一是 UTF16Encoding
SourceCharacter
的代码点值。
- codepoint "b" 是 codeunit
\x0062
所以这里的结果本质上是单个 16 位单元的代码单元序列 \x0062
SV(\a)
SingleStringCharacter::
\
EscapeSequence
的SV
是EscapeSequence
的SV
。
- 本质上是
SV(EscapeSequence)
这个SV(a)
(没有\
前缀)
EscapeSequence::
CharacterEscapeSequence
的 SV
是 CharacterEscapeSequence
的 SV
。
- 基本上只是路过
SV(a)
CharacterEscapeSequence::
NonEscapeCharacter
的SV
是NonEscapeCharacter
的SV
。
- 更多pass-through
-
NonEscapeCharacter::
SourceCharacter
的 SV
而不是 EscapeCharacter
或 LineTerminator
之一是代码点值的 UTF16Encoding
源字符。
- 代码点 "a" 是代码单元
\x0061
,因此这导致 single-unit 序列仅 \x0061
.
SV(r)
- 按照与
SV(b)
相同的步骤,这会产生包含 \x0072
. 的 single-unit 序列
- 将序列
SV(b) + SV(\a) + SV(r)
合并回来,字符串的值就是UTF16编码单元的序列[\x0062, \x0061, \x0072]
。该代码单元序列导致 bar
.
编辑:
I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules?
从词法分析器的角度来看,"token" 是 StringLiteral
,其中的所有内容都只是关于如何解析的信息。 EscapeSequence
不是令牌类型。
SV
定义如何将 StringLiteral 标记分解为一系列代码单元。
如 11 ECMAScript Language: Lexical Grammar
中所述
The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
这些"input elements"是解析器语法使用的标记。
Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no?
不仅仅是值,还有数据类型。使用 Flow/Typescript-style 注释,您可以想到上述步骤
SingleStringCharacter::
\
EscapeSequence
的SV
是EscapeSequence
的SV
。
EscapeSequence::
CharacterEscapeSequence
的SV
是CharacterEscapeSequence
的SV
。
CharacterEscapeSequence::
的SV
NonEscapeCharacter
是NonEscapeCharacter
的SV
。
-
NonEscapeCharacter::
SourceCharacter
的 SV
而不是 EscapeCharacter
或 LineTerminator
之一是代码点值的 UTF16Encoding
SourceCharacter.
好像它是一个重载函数,例如
function SV(parts: ["\", EscapeSequence]) {
return SV(parts[1]);
}
function SV(parts: [CharacterEscapeSequence]) {
return SV(parts[0]);
}
function SV(parts: [NonEscapeCharacter]) {
return SV(parts[0]);
}
function SV(parts: [SourceCharacter]) {
return UTF16Encoding(parts[0]);
}
所以 SV(a)
有点像 SV("a": [CharacterEscapeSequence])
而 SV(b)
有不同的类型。
我正在尝试理解 ECMAScript 2017.
之后字符串文字到最终字符串值(由代码单元值组成)的转换相关摘录
5.1.2 词法和 RegExp 文法
A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.
Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.
5.1.4 句法语法
When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.
和
11 ECMAScript 语言:词法语法
The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
11.8.4 字符串文字
StringLiteral ::
" DoubleStringCharacters_opt "
' SingleStringCharacters_opt '
SingleStringCharacters ::
SingleStringCharacter SingleStringCharacters_opt
SingleStringCharacter ::
SourceCharacter but not one of ' or \ or LineTerminator
\ EscapeSequence
LineContinuation
EscapeSequence ::
CharacterEscapeSequence
0 [lookahead ∉ DecimalDigit]
HexEscapeSequence
UnicodeEscapeSequence
CharacterEscapeSequence ::
SingleEscapeCharacter
NonEscapeCharacter
NonEscapeCharacter ::
SourceCharacter but not one of EscapeCharacter or LineTerminator
EscapeCharacter ::
SingleEscapeCharacter
DecimalDigit
x
u
11.8.4.3 静态语义:SV
A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of code unit values contributed by the various parts of the string literal.
和
The SV of SingleStringCharacter :: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.
The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the EscapeSequence.
问题
假设我们有字符串文字 'b\ar'
。我现在想按照上面的词法语法和语义语法,将字符串文字变成一组代码单元值。
b\ar
被识别为 CommonTokenb\ar
被进一步识别为 StringLiteral- StringLiteral 被翻译成 SingleStringCharacters
- SingleStringCharacters 中的每个代码点都转换为 SingleStringCharacter
- 前面没有
\
的每个 SingleStringCharacter 都被转换为 SourceCharacter \a
被识别为 \EscapeSequence- EscapeSequence (a) 被翻译成 NonEscapeCharacter
- NonEscapeCharacter 被翻译成 SourceCharacter
- 所有 SourceCharacter 都被翻译成
any Unicode code point
- 最后,应用 SV 规则来获取字符串值和代码单元值
我遇到的问题是 StringLiteral 输入元素现在是:
SourceCharacter, \ SourceCharacter, SourceCharacter
\SourceCharacter没有SV规则,只有\EscapeCharacter.
这让我想知道我是否顺序错误,或者误解了词法和句法语法的应用方式。
我也对如何完全应用 SV 规则感到困惑。因为它们被定义为应用于非终结符号,而不是终结符号(这应该是应用词法语法后的结果)。
非常感谢任何帮助。
好吧,假设我们要使用单个令牌 'b\ar'
,也就是您所说的 StringLiteral
令牌。应用 11.8.4.3 Static Semantics: SV as well as 10.1.1 Static Semantics: UTF16Encoding(cp) 中定义的算法,我们遵循 SV
规则:
StringLiteral::
'
SingleStringCharacters
'
的 SV 是SingleStringCharacters
的SV
。- 展开引号,因为我们只在
SingleStringCharacters
部分递归 运行SV
,例如SV(b\ar)
- 展开引号,因为我们只在
SingleStringCharacters::
的SV
SingleStringCharacterSingleStringCharacters
是一个或两个代码单元的序列,即SingleStringCharacter
的SV
然后依次是SingleStringCharacters
的SV
中的所有代码单元。这表示 "call SV every
SingleStringCharacter
appending results"。SV(b)
SingleStringCharacter::
SourceCharacter
的SV
但不是'
或\
或LineTerminator
之一是UTF16Encoding
SourceCharacter
的代码点值。- codepoint "b" 是 codeunit
\x0062
所以这里的结果本质上是单个 16 位单元的代码单元序列\x0062
- codepoint "b" 是 codeunit
SV(\a)
SingleStringCharacter::
\
EscapeSequence
的SV
是EscapeSequence
的SV
。- 本质上是
SV(EscapeSequence)
这个SV(a)
(没有\
前缀)
- 本质上是
EscapeSequence::
CharacterEscapeSequence
的SV
是CharacterEscapeSequence
的SV
。- 基本上只是路过
SV(a)
- 基本上只是路过
CharacterEscapeSequence::
NonEscapeCharacter
的SV
是NonEscapeCharacter
的SV
。- 更多pass-through
-
NonEscapeCharacter::
SourceCharacter
的SV
而不是EscapeCharacter
或LineTerminator
之一是代码点值的UTF16Encoding
源字符。- 代码点 "a" 是代码单元
\x0061
,因此这导致 single-unit 序列仅\x0061
.
- 代码点 "a" 是代码单元
SV(r)
- 按照与
SV(b)
相同的步骤,这会产生包含\x0072
. 的 single-unit 序列
- 按照与
- 将序列
SV(b) + SV(\a) + SV(r)
合并回来,字符串的值就是UTF16编码单元的序列[\x0062, \x0061, \x0072]
。该代码单元序列导致bar
.
编辑:
I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules?
从词法分析器的角度来看,"token" 是 StringLiteral
,其中的所有内容都只是关于如何解析的信息。 EscapeSequence
不是令牌类型。
SV
定义如何将 StringLiteral 标记分解为一系列代码单元。
如 11 ECMAScript Language: Lexical Grammar
中所述The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.
这些"input elements"是解析器语法使用的标记。
Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no?
不仅仅是值,还有数据类型。使用 Flow/Typescript-style 注释,您可以想到上述步骤
SingleStringCharacter::
\
EscapeSequence
的SV
是EscapeSequence
的SV
。EscapeSequence::
CharacterEscapeSequence
的SV
是CharacterEscapeSequence
的SV
。CharacterEscapeSequence::
的SV
NonEscapeCharacter
是NonEscapeCharacter
的SV
。-
NonEscapeCharacter::
SourceCharacter
的SV
而不是EscapeCharacter
或LineTerminator
之一是代码点值的UTF16Encoding
SourceCharacter.
好像它是一个重载函数,例如
function SV(parts: ["\", EscapeSequence]) {
return SV(parts[1]);
}
function SV(parts: [CharacterEscapeSequence]) {
return SV(parts[0]);
}
function SV(parts: [NonEscapeCharacter]) {
return SV(parts[0]);
}
function SV(parts: [SourceCharacter]) {
return UTF16Encoding(parts[0]);
}
所以 SV(a)
有点像 SV("a": [CharacterEscapeSequence])
而 SV(b)
有不同的类型。