错误标记的位置始终从 0 开始

Question

我正在编写一个带有错误处理的解析器。我想向用户输出无法解析的输入部分的确切位置。

但是，错误标记的位置始终从 0 开始，即使之前是成功解析的部分也是如此。

这是我所做的一个高度简化的例子。（有问题的部分大概在parser.yy。）

Location.hh:

#pragma once
#include <string>

// The full version tracks position in bytes, line number and offset in the current line.
// Here however, I've shortened it to line number only.
struct Location
{
    int beginning, ending;
    operator std::string() const { return std::to_string(beginning) + '-' + std::to_string(ending); }
};

LexerClass.hh:

#pragma once
#include <istream>
#include <string>
#if ! defined(yyFlexLexerOnce)
    #include <FlexLexer.h>
#endif
#include "Location.hh"

class LexerClass : public yyFlexLexer
{
    int currentPosition = 0;
protected:
    std::string *yylval = nullptr;
    Location *yylloc = nullptr;
public:
    LexerClass(std::istream &in) : yyFlexLexer(&in) {}
    [[nodiscard]] int yylex(std::string *const lval, Location *const lloc);
    void onNewLine() { yylloc->beginning = yylloc->ending = ++currentPosition; }
};

lexer.ll:

%{
    #include "./parser.hh"
    #include "./LexerClass.hh"
    
    #undef  YY_DECL
    #define YY_DECL int LexerClass::yylex(std::string *const lval, Location *const lloc)
%}

%option c++ noyywrap
%option yyclass="LexerClass"

%%

%{
    yylval = lval;
    yylloc = lloc;
%}

[[:blank:]] ;
\n          { onNewLine(); }
[0-9]       { return yy::Parser::token::DIGIT; }
.           { return yytext[0]; }

parser.yy:

%language "c++"

%code requires {
    #include "LexerClass.hh"
    #include "Location.hh"
}

%define api.parser.class {Parser}
%define api.value.type {std::string}
%define api.location.type {Location}
%parse-param {LexerClass &lexer}
%defines

%code {
    template<typename RHS>
    void calcLocation(Location &current, const RHS &rhs, const int n);
    #define YYLLOC_DEFAULT(Cur, Rhs, N) calcLocation(Cur, Rhs, N)
    
    #define yylex lexer.yylex
}

%token DIGIT

%%

numbers:
      %empty
    | numbers number ';' { std::cout << std::string(@number) << "\tnumber" << std::endl; }
    | error ';' { yyerrok; std::cerr << std::string(@error) << "\terror context" << std::endl; }
    ;

number:
      DIGIT {}
    | number DIGIT {}
    ;

%%

#include <iostream>

template<typename RHS>
inline void calcLocation(Location &current, const RHS &rhs, const int n)
{
    current = (n <= 1)
        ? YYRHSLOC(rhs, n)
        : Location{YYRHSLOC(rhs, 1).beginning, YYRHSLOC(rhs, n).ending};
}

void yy::Parser::error(const Location &location, const std::string &message)
{
    std::cout << std::string(location) << "\terror: " << message << std::endl;
}

int main()
{
    LexerClass lexer(std::cin);
    yy::Parser parser(lexer);
    return parser();
}

对于输入：

预期输出：

0-2 number
3-3 number
5-5 error: syntax error
4-6 error context
7-8 number

实际输出：

0-2 number
3-3 number
5-5 error: syntax error
0-6 error context
7-8 number

Answer 1

这是您的 numbers 规则，供参考（没有操作，因为它们并不真正相关）：

numbers:
      %empty
    | numbers number ';'
    | error ';'

numbers 也是您的开始符号。应该相当清楚的是，在任何推导中都没有 before a numbers non-terminal。有一个 top-level numbers non-terminal，它包含整个输入，它以 numbers non-terminal 开头，它包含除最后一个 [=16= 之外的所有内容] ;，等等。所有这些numbers从头开始。

同样，error 伪令牌处于某些 numbers 推导的开始。所以它也必须从输入的开头开始。

换句话说，你所说的“错误标记的位置总是从 0 开始，即使之前它是被成功解析的部分”是无法测试的。错误标记的位置始终从 0 开始，因为它之前不能有任何内容，并且您收到的输出是“预期的”。或者，至少是可预测的；我知道您没有预料到它，而且很容易陷入这种困惑。直到我运行启用了跟踪的解析器，我才真正看到它，这是强烈推荐的；请注意，这样做有助于添加 std::operator(ostream&, Location const&).

的重载

Answer 2

我是在的基础上构建的，所以请先阅读那个。

让我们考虑规则：

numbers:
      %empty
    | numbers number ';'
    | error ';' { yyerrok; }
    ;

这意味着非终结符 numbers 可以是以下三种之一：

可能是空的。
它可以是 number 前面有任何有效的 numbers。
可能是 error。

你看到问题了吗？整个numbers必须是一个error，从头开始；没有规则说在它之前允许做任何其他事情。 Bison当然乖乖的遵从你的意愿，让error从非终结符numbers的开头开始。它可以做到这一点，因为 error 是所有行业的杰作，并且没有关于它的内部可以包含什么的规则。 Bison，为了满足您的规则，需要将 error 扩展到所有之前的 numbers。

当您了解问题所在时，解决它就很容易了。您只需要告诉 Bison 在 error:

之前允许 numbers

numbers:
      %empty
    | numbers number ';'
    | numbers error ';' { yyerrok; }
    ;

这是 IMO 的最佳解决方案。不过还有另一种方法。

您可以将 error 令牌移动到 number:

numbers:
      %empty
    | numbers number ';' { yyerrok; }
    ;

number:
      DIGIT
    | number DIGIT
    | error
    ;

请注意 yyerrok 需要保留在 numbers 中，因为如果将它放在以标记 error.[=33 结尾的规则旁边，解析器将进入无限循环=]

这种方法的一个缺点是，如果您在此 error 旁边放置一个动作，它将被触发多次（每个非法终端或多或少触发一次）。也许在某些情况下这是更可取的，但通常我建议使用第一种方法来解决问题。

错误标记的位置始终从 0 开始

Location of the error token always starts at 0

c++

error-handling

bison