jison-gho 中的自定义位置跟踪

Custom location tracking in jison-gho

我需要解析 "token-level" 语言,即输入已经用分号作为分隔符进行了标记化。示例输入:A;B;A;D0;ASSIGN;X;。这也是我的 grammar file.

我想跟踪每个标记的位置列。对于前面的示例,这里是我希望定义列的方式:

Input: A;B;A;D0;ASSIGN;X;\n
Col:   1122334445555555666

所以基本上我想在每次输入分号时增加列。我创建了一个函数,当分号被击中时增加列数,对于所有操作,我只是将 yylloc 中的列设置为我的自定义列数。但是,使用这种方法,我必须复制粘贴对每个操作的函数调用。您知道是否还有其他更清洁的方法吗?此外,由于是自动生成的,因此输入中不会出现词法错误。

编辑:没关系,我的解决方案实际上不起作用。所以我很乐意提出任何建议:)

%lex

%{

var delimit = (terminal) => { this.begin('delimit'); return terminal }

var columnInc = () => {
  if (yy.lastLine === undefined) yy.lastLine = -1
    if (yylloc.first_line !== yy.lastLine) {
      yy.lastLine = yylloc.first_line
      yy.columnCount = 0
  }
  yy.columnCount++
}

var setColumn = () => {
  yylloc.first_column = yylloc.last_column = yy.columnCount
}

%}

%x delimit

%%

"ASSIGN"                   { return delimit('ASSIGN'); setColumn() }
"A"                        { return delimit('A'); setColumn() }
<delimit>{DELIMITER}       { columnInc(); this.popState(); setColumn() }
\n                         { setColumn() }
...

jison-gho 中有几种方法可以做到这一点。当您希望实现由解析器跟踪的 令牌计数器 时,这总是意味着我们需要找到一种方法 'hook' 进入词法分析器通过的代码路径解析器的标记。

在我们查看一些实现之前,一些想法可能会帮助其他面临类似但略有不同问题的人:

  1. 准备好的令牌流的完全自定义词法分析器:由于您的输入已经是一组令牌,因此可以考虑使用自定义词法分析器 然后将按原样获取输入流,并在将标记传递给解析器时尽可能少地做。这在 jison-gho 中是可行的,这里展示了一个相当小的例子:

    https://github.com/GerHobbelt/jison/blob/0.6.1-215/examples/documentation--custom-lexer-ULcase.jison

    这里演示了另一种集成相同自定义词法分析器的方法:

    https://github.com/GerHobbelt/jison/blob/0.6.1-215/examples/documentation--custom-lexer-ULcase-alt.jison

    或者您可能希望通过 %include "documentation--custom-lexer-ULcase.js" 语句包含来自 external file 的代码。反正我跑题了。

    Given your problem, depending on where that token stream comes from (who turns that into text? Is that outside your control as there's a huge overhead cost there as you're generating, then parsing a very long stream of text, while a custom lexer and some direct binary communications might reduce network or other costs there.

    The bottom line is: if the token generator and everything up to this parse point is inside your control, I personally would favor a custom lexer and no text conversion what-so-ever for the intermediary channel. But in the end, that depends largely on your system requirements, context, etc. and is way outside the realm of this SO coding question.

  2. 增强 jison 词法分析器:当然另一种方法可能是增强所有(或有限的一组) 词法分析器规则的操作代码,您可以在其中修改 yytext 以将此数据传递给解析器。这是 yacc/bison 时代的经典方法。事实上,yytext 不一定是字符串,可以是 任何东西 ,例如

[a-z]   %{
    yytext = new DataInstance(
      yytext, // the token string 
      yylloc, // the token location info
      ...     // whatever you want/need...
   );
   return 'ID';  // the lexer token ID for this token
%}

对于这个问题,这是一个 很多 的代码重复,因此是一个维护恐怖。

  1. 连接到解析器和词法分析器之间的流程:这是新的,由 jison-gho 工具通过 pre_lexpost_lex回调。 (围绕 parse() 调用可以使用相同的机制,因此您可以以任何您想要的方式初始化和后处理解析器 运行:pre_parsepost_parse 就是为此。

    在这里,因为我们想要计数标记,最简单的方法是使用post_lex钩子,它只在词法分析器完全解析另一个时调用token 并将其传递给解析器。换句话说:post_lex 在解析器中 lex() 调用的最后执行。

    The documentation for these is included at the top of every generated parser/lexer JS source file, but then, of course, you need to know about that little nugget! ;-)

    这里是:

    parser.lexer.options:

    • pre_lex: function()

      optional: is invoked before the lexer is invoked to produce another token.

      this refers to the Lexer object.

    • post_lex: function(token) { return token; }

      optional: is invoked when the lexer has produced a token token; this function can override the returned token value by returning another. When it does not return any (truthy) value, the lexer will return the original token.

      this refers to the Lexer object.

请注意,选项 1 和选项 3 在 vanilla jison 中不可用,关于选项 1 的一个注释:jison does not accept a custom lexer as part of the jison parser/lexer spec file 如上面的示例链接所示。当然,您 可以 总是绕过 包装 生成的解析器,从而注入自定义词法分析器并做其他事情。

使用 post_lex

实现令牌计数器

现在实际情况如何?

解决方案 1:让我们好好

我们将'abuse'/使用(取决于您关于使用未记录功能的 POV)yylloc 信息对象并增强柜台会员。我们选择这样做是为了 永远不会 冒干扰(或受到干扰)词法分析器和解析器中默认 text/line-oriented yylloc 位置跟踪系统的风险。

The undocumented bit here is the knowledge that all data members of any yylloc instance will be propagated by the default jison-gho location tracking&merging logic in the parser, hence when you tweak an yylloc instance in the lexer or parser action code, and if that yylloc instance is propagated to the output via merge or copy as the parser walks up the grammar tree, then your tweaks will be visible in the output.

连接到 lexer 标记输出意味着我们必须首先扩充 lexer,我们可以轻松做到 in the %% section before the /lex end-of-lexer-spec-marker:

// lexer extra code

var token_counter = 0;

lexer.post_lex = function (token) {
    // hello world
    ++token_counter;
    this.yylloc.counter = token_counter;
    return token;
};

// extra helper so multiple parse() calls will restart counting tokens:
lexer.reset_token_counter = function () {
    token_counter = 0;
};

其中 魔法位 是这样的语句:this.yylloc.counter = token_counter.

我们通过 lexer.post_lex = function (){...}.

pre_lex 回调直接注入到词法分析器定义中,从而将其挂钩到流程中

We could also have done this via the lexer options: lexer.options.post_lex = function ... or via the parser-global yy instance: parser.yy.post_lex = function ... though those approaches would have meant we'ld be doing this in the parser definition code chunk or from the runtime which invokes the parser. These two slightly different approaches will not be demonstrated here.

现在我们所要做的就是使用一小段 pre_parse 代码来完成此操作,以确保多次 parser.parse(input) 调用每次都将使用令牌 restart计数器归零:

// extra helper: reset the token counter at the start of every parse() call:
parser.pre_parse = function (yy) {
    yy.lexer.reset_token_counter();
};

当然,该位必须添加到 解析器 的最终代码块 after the %% in the grammar spec part of the jison file.

完整的jison源文件是available as a gist here

如何编译和测试:

# compile
jison --main so-q-58891186-2.jison
# run test code in main()
node so-q-58891186-2.js

Notes: I have 'faked' the AST construction code in your original source file so that one can easily diff the initial file with the one provided here. All that hack-it-to-make-it-work stuff is at the bottom part of the file.

解决方案 2:有点 讨厌 并重新使用 yylloc.column 位置信息和跟踪

我没有使用 yyllocline 信息部分,而是选择使用 column 部分,至于我认为这与令牌序列索引的粒度级别大致相同。只要您遵循相同的模式,使用哪一个,行或列都没有关系。

当我们这样做正确时,我们得到了免费添加的jison-gho的位置跟踪功能,即:列和行范围 的语法规则是 自动 从单个标记 yylloc 信息中计算出来的 yylloc 的 first/last 成员将显示 firstlast 列,对不起, token index 匹配的令牌序列给定的语法规则。这是 classic,merge CLI 选项中提到的 classic,merge jison-gho 行为:

--default-action

Specify the kind of default action that jison should include for every parser rule.

You can specify a mode for value handling ($$) and one for location tracking (@$), separated by a comma, e.g.:

       --default-action=ast,none

Supported value modes:

  • classic : generate a parser which includes the default

                   $$ = ;
    

    action for every rule.

  • ast : generate a parser which produces a simple AST-like tree-of-arrays structure: every rule produces an array of its production terms' values. Otherwise it is identical to classic mode.
  • none : JISON will produce a slightly faster parser but then you are solely responsible for propagating rule action $$ results.

    The default rule value is still deterministic though as it is set to undefined: $$ = undefined;

  • skip : same as none mode, except JISON does NOT INJECT a default value action ANYWHERE, hence rule results are not deterministic when you do not properly manage the $$ value yourself!

Supported location modes:

  • merge : generate a parser which includes the default @$ = merged(@1..@n); location tracking action for every rule, i.e. the rule's production 'location' is the range spanning its terms.
  • classic : same as merge mode.
  • ast : ditto.
  • none : JISON will produce a slightly faster parser but then you are solely responsible for propagating rule action @$ location results. The default rule location is still deterministic though, as it is set to undefined: @$ = undefined;
  • skip : same as "none" mode, except JISON does NOT INJECT a default location action ANYWHERE, hence rule location results are not deterministic when you do not properly manage the @$ value yourself!

Notes:

  • when you do specify a value default mode, but DO NOT specify a location value mode, the latter is assumed to be the same as the former.

    Hence:

         --default-action=ast
    

    equals:

         --default-action=ast,ast
    
  • when you do not specify an explicit default mode or only a "true"/"1" value, the default is assumed: classic,merge.

  • when you specify "false"/"0" as an explicit default mode, none,none is assumed. This produces the fastest deterministic parser.

Default setting: [classic,merge]

现在我们要 're-use' yyllocfist_columnlast_column 成员,而不是添加新的 counter 成员,魔术位所做的工作与解决方案 1 几乎相同:

augmenting the lexer in its %% section:

// lexer extra code

var token_counter = 0;

lexer.post_lex = function (token) {
    ++token_counter;

    this.yylloc.first_column = token_counter;
    this.yylloc.last_column = token_counter;

    return token;
};

// extra helper so multiple parse() calls will restart counting tokens:
lexer.reset_token_counter = function () {
    token_counter = 0;
};

Side Note: we 'abuse' the column part for tracking the token number; meanwhile the range member will still be usable to debug the raw text input as that one will track the positions within the raw input string.

Make sure to tweak both first_column and last_column so that the default location tracking 'merging' code in the generated parser can still do its job: that way we'll get to see which range of tokens constitute a particular grammar rule/element, just like it were text columns.

Could've done the same with first_line/last_line, but I felt it more suitable to use the column part for this as it's at the same very low granularity level as 'token index'...

我们通过 lexer.post_lex = function (){...}.

pre_lex 回调直接注入到词法分析器定义中,从而将其挂钩到流程中

与解决方案 1 相同,现在我们所要做的就是使用一小段 pre_parse 代码来完成此操作,以确保多次 parser.parse(input) 调用每次都会 重新启动 令牌计数器重置为零:

// extra helper: reset the token counter at the start of every parse() call:
parser.pre_parse = function (yy) {
    yy.lexer.reset_token_counter();
};

当然,该位必须添加到 解析器 的最终代码块,after the %% in the grammar spec part of the jison file.

完整的jison源文件是available as a gist here.

如何编译和测试:

# compile
jison --main so-q-58891186-3.jison
# run test code in main()
node so-q-58891186-3.js

后果/对所提供解决方案的观察

观察为 令牌索引 如何在解析器输出中显示而提供的两个 jison 文件末尾的测试验证数据:

解决方案 1 (stripped, partial) output:

  "type": "ProgramStmt",
  "a1": [
    {
      "type": "ExprStmt",
      "a1": {
        "type": "AssignmentValueExpr",
        "target": {
          "type": "VariableRefExpr",
          "a1": "ABA0",
          "loc": {
            "range": [
              0,
              8
            ],
            "counter": 1
          }
        },
        "source": {
          "type": "VariableRefExpr",
          "a1": "X",
          "loc": {
            "counter": 6
          }
        },
        "loc": {
          "counter": 1
        }
      },
      "loc": {
        "counter": 1
      }
    }
  ],
  "loc": {
    "counter": 1
  }

请注意,counter 索引对于复合元素并不准确,即由匹配一个或多个语法规则的多个标记构成的元素:仅保留第一个标记索引。

解决方案 2 在这方面的表现要好得多:

解决方案 2 (stripped, partial) output:

      "type": "ExprStmt",
      "a1": {
        "type": "AssignmentValueExpr",
        "target": {
          "type": "VariableRefExpr",
          "a1": "ABA0",
          "loc": {
            "first_column": 1,
            "last_column": 4,
          }
        },
        "source": {
          "type": "VariableRefExpr",
          "a1": "X",
          "loc": {
            "first_column": 6,
            "last_column": 6,
          }
        },
        "loc": {
          "first_column": 1,
          "last_column": 6,
        }
      },
      "loc": {
        "first_column": 1,
        "last_column": 7,
      }
    }

如您所见,first_column 加上 last_column 成员很好地跟踪了构成每个部分的标记集。 (请注意,计数器增量代码暗示我们从一(1)开始计数,而不是零(0)!)

离别的思念

给定输入 A;B;A;D0;ASSIGN;X;SEMICOLON;,当前语法像 ABA0 = X; 一样解析这个,我想知道这是否是您真正想要得到的:像这样构造标识符 ABA0 似乎有点奇怪对我来说。

唉,这与你的问题无关。只是我在这里遇到了一些很不寻常的,仅此而已。没关系。

干杯,希望这篇长文能对我们更多人有所帮助。 :-)

源文件: