ECMAScript 2017:从非终结符 StringLiteral 解析为 String 值

ECMAScript 2017: Parsing from nonterminal StringLiteral to String values

我正在尝试理解 ECMAScript 2017.

之后字符串文字到最终字符串值(由代码单元值组成)的转换

相关摘录

5.1.2 词法和 RegExp 文法

A lexical grammar for ECMAScript is given in clause 11. This grammar has as its terminal symbols Unicode code points that conform to the rules for SourceCharacter defined in 10.1. It defines a set of productions, starting from the goal symbol InputElementDiv, InputElementTemplateTail, or InputElementRegExp, or InputElementRegExpOrTemplateTail, that describe how sequences of such code points are translated into a sequence of input elements.

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language.

5.1.4 句法语法

When a stream of code points is to be parsed as an ECMAScript Script or Module, it is first converted to a stream of input elements by repeated application of the lexical grammar; this stream of input elements is then parsed by a single application of the syntactic grammar.

11 ECMAScript 语言:词法语法

The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.

11.8.4 字符串文字

StringLiteral ::
    " DoubleStringCharacters_opt "
    ' SingleStringCharacters_opt '

SingleStringCharacters ::
    SingleStringCharacter SingleStringCharacters_opt

SingleStringCharacter ::
    SourceCharacter but not one of ' or \ or LineTerminator
    \ EscapeSequence
    LineContinuation

EscapeSequence ::
    CharacterEscapeSequence
    0 [lookahead ∉ DecimalDigit]
    HexEscapeSequence
    UnicodeEscapeSequence

CharacterEscapeSequence ::
    SingleEscapeCharacter
    NonEscapeCharacter

NonEscapeCharacter ::
    SourceCharacter but not one of EscapeCharacter or LineTerminator

EscapeCharacter ::
    SingleEscapeCharacter
    DecimalDigit
    x
    u

11.8.4.3 静态语义:SV

A string literal stands for a value of the String type. The String value (SV) of the literal is described in terms of code unit values contributed by the various parts of the string literal.

The SV of SingleStringCharacter :: SourceCharacter but not one of ' or \ or LineTerminator is the UTF16Encoding of the code point value of SourceCharacter.

The SV of SingleStringCharacter :: \ EscapeSequence is the SV of the EscapeSequence.


问题

假设我们有字符串文字 'b\ar'。我现在想按照上面的词法语法和语义语法,将字符串文字变成一组代码单元值。

  1. b\ar 被识别为 CommonToken
  2. b\ar 被进一步识别为 StringLiteral
  3. StringLiteral 被翻译成 SingleStringCharacters
  4. SingleStringCharacters 中的每个代码点都转换为 SingleStringCharacter
  5. 前面没有 \ 的每个 SingleStringCharacter 都被转换为 SourceCharacter
  6. \a 被识别为 \EscapeSequence
  7. EscapeSequence (a) 被翻译成 NonEscapeCharacter
  8. NonEscapeCharacter 被翻译成 SourceCharacter
  9. 所有 SourceCharacter 都被翻译成 any Unicode code point
  10. 最后,应用 SV 规则来获取字符串值和代码单元值

我遇到的问题是 StringLiteral 输入元素现在是:

SourceCharacter, \ SourceCharacter, SourceCharacter

\SourceCharacter没有SV规则,只有\EscapeCharacter.

这让我想知道我是否顺序错误,或者误解了词法和句法语法的应用方式。

我也对如何完全应用 SV 规则感到困惑。因为它们被定义为应用于非终结符号,而不是终结符号(这应该是应用词法语法后的结果)。

非常感谢任何帮助。

好吧,假设我们要使用单个令牌 'b\ar',也就是您所说的 StringLiteral 令牌。应用 11.8.4.3 Static Semantics: SV as well as 10.1.1 Static Semantics: UTF16Encoding(cp) 中定义的算法,我们遵循 SV 规则:

  1. StringLiteral:: ' SingleStringCharacters ' 的 SV 是 SingleStringCharactersSV
    • 展开引号,因为我们只在 SingleStringCharacters 部分递归 运行 SV,例如SV(b\ar)
  2. SingleStringCharacters::SV SingleStringCharacterSingleStringCharacters是一个或两个代码单元的序列,即SingleStringCharacterSV然后依次是SingleStringCharactersSV中的所有代码单元。

    这表示 "call SV every SingleStringCharacter appending results"。

    1. SV(b)
      1. SingleStringCharacter:: SourceCharacterSV 但不是 '\LineTerminator 之一是 UTF16Encoding SourceCharacter 的代码点值。
        • codepoint "b" 是 codeunit \x0062 所以这里的结果本质上是单个 16 位单元的代码单元序列 \x0062
    2. SV(\a)
      1. SingleStringCharacter::\EscapeSequenceSVEscapeSequenceSV
        • 本质上是SV(EscapeSequence)这个SV(a)(没有\前缀)
      2. EscapeSequence:: CharacterEscapeSequenceSVCharacterEscapeSequenceSV
        • 基本上只是路过SV(a)
      3. CharacterEscapeSequence::NonEscapeCharacterSVNonEscapeCharacterSV
        • 更多pass-through
      4. NonEscapeCharacter:: SourceCharacterSV 而不是 EscapeCharacterLineTerminator 之一是代码点值的 UTF16Encoding源字符。
        • 代码点 "a" 是代码单元 \x0061,因此这导致 single-unit 序列仅 \x0061.
    3. SV(r)
      • 按照与 SV(b) 相同的步骤,这会产生包含 \x0072.
      • 的 single-unit 序列
  3. 将序列SV(b) + SV(\a) + SV(r)合并回来,字符串的值就是UTF16编码单元的序列[\x0062, \x0061, \x0072]。该代码单元序列导致 bar.

编辑:

I though we should first apply the lexical grammar and end up with tokens, and then subsequently apply the SV rules?

从词法分析器的角度来看,"token" 是 StringLiteral,其中的所有内容都只是关于如何解析的信息。 EscapeSequence 不是令牌类型。

SV 定义如何将 StringLiteral 标记分解为一系列代码单元。

11 ECMAScript Language: Lexical Grammar

中所述

The source text of an ECMAScript Script or Module is first converted into a sequence of input elements, which are tokens, line terminators, comments, or white space. The source text is scanned from left to right, repeatedly taking the longest possible sequence of code points as the next input element.

这些"input elements"是解析器语法使用的标记。

Assuming the order of events is right, my second questions is around SV(\a). The first escape sequence rule is applied and we are left with SV(a), which should follow the same path as SV(b) no?

不仅仅是值,还有数据类型。使用 Flow/Typescript-style 注释,您可以想到上述步骤

  1. SingleStringCharacter::\EscapeSequenceSVEscapeSequenceSV
  2. EscapeSequence::CharacterEscapeSequenceSVCharacterEscapeSequenceSV
  3. CharacterEscapeSequence::SVNonEscapeCharacterNonEscapeCharacterSV
  4. NonEscapeCharacter:: SourceCharacterSV 而不是 EscapeCharacterLineTerminator 之一是代码点值的 UTF16Encoding SourceCharacter.

好像它是一个重载函数,例如

function SV(parts: ["\", EscapeSequence]) {
    return SV(parts[1]);
}
function SV(parts: [CharacterEscapeSequence]) {
    return SV(parts[0]);
}
function SV(parts: [NonEscapeCharacter]) {
    return SV(parts[0]);
}
function SV(parts: [SourceCharacter]) {
    return UTF16Encoding(parts[0]);
}

所以 SV(a) 有点像 SV("a": [CharacterEscapeSequence])SV(b) 有不同的类型。