如何区分在 Lexer 中具有相似模式但在解析器中出现在不同上下文中的标记

Question

我在 Lexer.x 中有两个非常相似的模式，第一个是数字，第二个是字节。他们来了。

$digit=0-9
$byte=[a-f0-9]


    $digit+                       { \s -> TNum  (readRational s) }
    $digit+.$digit+               { \s -> TNum  (readRational s) }
    $digit+.$digit+e$digit+       { \s -> TNum  (readRational s) }
    $digit+e$digit+               { \s -> TNum  (readRational s) }
    $byte$byte                        { \s -> TByte (encodeUtf8(pack s))     }

我有Parser.y

%token

        cnst                            { TNum  $$}
        byte                            { TByte  $$}
        '['                            { TOSB     }    
        ']'                            { TCSB     }

%%

Expr: 
 '[' byte ']' {}
| const {}

当我写的时候，我得到了。

[ 11 ] parse error
11 ok

但是当我在 Lexer 中将字节模式放在数字之前时

$digit=0-9
$byte=[a-f0-9]

    $byte$byte                        { \s -> TByte (encodeUtf8(pack s))     }
    $digit+                       { \s -> TNum  (readRational s) }
    $digit+.$digit+               { \s -> TNum  (readRational s) }
    $digit+.$digit+e$digit+       { \s -> TNum  (readRational s) }
    $digit+e$digit+               { \s -> TNum  (readRational s) }

我得到了

[ 11 ] ok
11 parse error

我认为发生这种情况是因为 Lexer 从字符串中生成标记，然后将它们提供给解析器。当解析器等待字节令牌时，它得到了数字令牌，解析器没有机会从这个值中生成另一个令牌。这种情况我该怎么办？

Answer 1

在这种情况下，您应该推迟解析。例如，您可以制作一个 TNumByte 数据构造函数，将值存储为 String:

Token
    = TByte ByteString
    | TNum Rational
    | TNumByte String
    -- …

对于一个$digit的序列，目前还不清楚是要把这个解释成byte还是number，所以我们为此构造一个TNumByte：

$digit=0-9
$byte=[a-f0-9]

<strong>$digit$digit</strong>                  { <strong>TNumByte</strong> }
$byte$byte                    { \s -> TByte (encodeUtf8(pack s)) }
$digit+                       { \s -> TNum  (readRational s) }
$digit+.$digit+               { \s -> TNum  (readRational s) }
$digit+.$digit+e$digit+       { \s -> TNum  (readRational s) }
$digit+e$digit+               { \s -> TNum  (readRational s) }

然后在解析器中我们可以根据上下文来决定：

%token

  cnst                           { TNum $$ }
  byte                           { TByte $$ }
  numbyte                        { TNumByte $$ }  -- 🖘 can be int or byte
  '['                            { TOSB }
  ']'                            { TCSB }

%%

Expr
  : '[' byte ']' {  }
  | '[' numbyte ']' { encodeUtf8(pack ) }  -- 🖘 interpret as byte
  | const {  }
  | numbyte { readRational  }  -- 🖘 interpret as int
  ;

如何区分在 Lexer 中具有相似模式但在解析器中出现在不同上下文中的标记

how to distinguish tokens which have similar patterns in Lexer, but they occur in different contexts in the parser

parsing

haskell

lexer

happy