使用 pegen 开发解析器:无输出

developing parser with pegen: no output

我想为预先存在的数据存储文件类型编写解析器。有一个正式的语法,我能够遵循 pegen 的语法指南来创建语法文件并编译并生成解析器。

我的问题是解析器没有产生任何输出,因为(至少我认为这是问题所在)我不知道如何在语法文件中设置正确的 return 类型。 github 数据文件夹中的示例没有太大帮助。

如何创建正确的 return 类型?

我的语法文件:

# Basic CIF structure
start: Comments? WhiteSpace? ( DataBlock ( WhiteSpace DataBlock )* ( WhiteSpace )? )?
DataBlock: DataBlockHeading ( WhiteSpace ( DataItems | SaveFrame ) )*
DataBlockHeading: DATA_ ( NonBlankChar )+
SaveFrame: SaveFrameHeading ( WhiteSpace DataItems )+ WhiteSpace SAVE_
SaveFrameHeading: SAVE_ ( NonBlankChar )+
DataItems: Tag WhiteSpace Value | LoopHeader LoopBody
LoopHeader: LOOP_ ( WhiteSpace Tag )+
LoopBody: Value ( WhiteSpace Value )*

# Reserved words
DATA_: ('D' | 'd') ('A' | 'a') ('T' | 't') ('A' | 'a') '_'
LOOP_: ('L' | 'l') ('O' | 'o') ('O' | 'o') ('P' | 'p') '_'
GLOBAL_: ('G' | 'g') ('L' | 'l') ('O' | 'o') ('B' | 'b') ('A' | 'a') ('L' | 'l') '_'
SAVE_: ('S' | 's') ('A' | 'a') ('V' | 'v') ('E' | 'e') '_'
STOP_:  ('S' | 's') ('T' | 't') ('O' | 'o') ('P' | 'p')'_'

# Tags and values
Tag: '_' ( NonBlankChar)+
Value: ( '.' | '?' | Numeric | CharString | TextField )

# Numeric values
Numeric: ( Number | Number '(' UnsignedInteger ')' )
Number: Integer | Float
Integer: ( '+' | '-' )? UnsignedInteger
Float: ( Integer Exponent | ( ( '+' | '-' )? ( ( Digit )* '.' UnsignedInteger ) | ( ( Digit )+ '.' ) ) ( Exponent )? )
Exponent: ( ('e' | 'E' ) | ( 'e' | 'E' ) ( '+' | '- ' ) ) UnsignedInteger
UnsignedInteger: ( Digit )+
Digit: ( '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' )

# Strings and text fields
CharString: UnquotedString | SingleQuotedString | DoubleQuotedString
UnquotedString: EOL_UnquotedString | NOTEOL_UnquotedString
EOL_UnquotedString: EOL OrdinaryChar ( NonBlankChar )*
NOTEOL_UnquotedString: NOTEOL ( OrdinaryChar | ';' ) ( NonBlankChar )*
SingleQuotedString: single_quote ( AnyPrintChar )* single_quote WhiteSpace
DoubleQuotedString: double_quote ( AnyPrintChar )* double_quote WhiteSpace
TextField: ( SemiColonTextField )
SemiColonTextField: EOL ';' ( ( AnyPrintChar )* EOL ( ( TextLeadChar ( AnyPrintChar )* )? EOL )* ) ';'

# Whitespace and comments
WhiteSpace: ( SP | HT | EOL | TokenizedComments )+
Comments: ( '#' ( AnyPrintChar )* EOL )+
TokenizedComments: ( SP | HT | EOL )+ Comments

# Character sets
OrdinaryChar: ( '!' | '%' | '&' | '(' | ')' | '*' | '+' | ',' | '-' | '.' | '/' | '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | ':' | '<' | '=' | '>' | '?' | '@' | 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I' | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R' | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z' | '\' | '^' | '`' | 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z' | '{' | '|' | '}' | '~' )
NonBlankChar: ( OrdinaryChar | double_quote | '#' | '$' | single_quote | '_' | ';' | '[' | ']' )
TextLeadChar: ( OrdinaryChar | double_quote | '#' | '$' | single_quote | '_' | SP | HT | '[' | ']' )
AnyPrintChar: ( OrdinaryChar | double_quote | '#' | '$' | single_quote | '_' | SP | HT | ';' | '[' | ']' )


# Special things
EOL: NEWLINE #( '\n' | '\n\r' )
NOTEOL: !EOL
SP: ' '
HT: '\t'
double_quote: '"'
single_quote: '\''

我要解析的测试文件(header_only.cif):

data_header

我是如何生成解析器的:

python -m pegen cif.gram -o parser.py

我如何使用解析器:

python parser.py -vv header_only.cif

我的输出:

start() ... (looking at 1.0: NAME:'data_header')
  Comments() ... (looking at 1.0: NAME:'data_header')
    _loop1_42() ... (looking at 1.0: NAME:'data_header')
      _tmp_58() ... (looking at 1.0: NAME:'data_header')
        expect('#') ... (looking at 1.0: NAME:'data_header')
        ... expect('#') -> None
      ... _tmp_58() -> None
    ... _loop1_42() -> []
  ... Comments() -> None
  WhiteSpace() ... (looking at 1.0: NAME:'data_header')
    _loop1_41() ... (looking at 1.0: NAME:'data_header')
      _tmp_57() ... (looking at 1.0: NAME:'data_header')
        SP() ... (looking at 1.0: NAME:'data_header')
          expect(' ') ... (looking at 1.0: NAME:'data_header')
          ... expect(' ') -> None
        ... SP() -> None
        HT() ... (looking at 1.0: NAME:'data_header')
          expect('\t') ... (looking at 1.0: NAME:'data_header')
          ... expect('\t') -> None
        ... HT() -> None
        EOL() ... (looking at 1.0: NAME:'data_header')
          expect('NEWLINE') ... (looking at 1.0: NAME:'data_header')
          ... expect('NEWLINE') -> None
        ... EOL() -> None
        TokenizedComments() ... (looking at 1.0: NAME:'data_header')
          _loop1_43() ... (looking at 1.0: NAME:'data_header')
            _tmp_59() ... (looking at 1.0: NAME:'data_header')
              SP() -> None
              HT() -> None
              EOL() -> None
            ... _tmp_59() -> None
          ... _loop1_43() -> []
        ... TokenizedComments() -> None
      ... _tmp_57() -> None
    ... _loop1_41() -> []
  ... WhiteSpace() -> None
  _tmp_1() ... (looking at 1.0: NAME:'data_header')
    DataBlock() ... (looking at 1.0: NAME:'data_header')
      DataBlockHeading() ... (looking at 1.0: NAME:'data_header')
        DATA_() ... (looking at 1.0: NAME:'data_header')
          _tmp_8() ... (looking at 1.0: NAME:'data_header')
            expect('D') ... (looking at 1.0: NAME:'data_header')
            ... expect('D') -> None
            expect('d') ... (looking at 1.0: NAME:'data_header')
            ... expect('d') -> None
          ... _tmp_8() -> None
        ... DATA_() -> None
      ... DataBlockHeading() -> None
    ... DataBlock() -> None
  ... _tmp_1() -> None
... start() -> [None, None, None]
[None, None, None]
Total time: 0.031 sec; 1 lines (13 bytes); 32 lines/sec
Caches sizes:
  token array :          1
        cache :         24

Pegen 为“python-like”语言生成解析器。据我所知,它并不打算成为 general-purpose 解析器生成器。

特别是,它假设被解析语言的词汇结构与 Python 的词汇结构非常相似,因此可以使用相同的分词器。对于您要解析的语言,情况似乎并非如此。特别是,您的语言没有等同于 Python 标记器在看到输入 data_header 时自动生成的 NAME 标记,这就是解析失败的原因。

Pegen 确实允许您定义关键字,它们是 NAME 的特定实例,但据我所知,它无法指定 case-independent 关键字。它也没有识别以前缀(如“data_”)开头的名称的 class 的机制。这些都是可以使用正则表达式轻松完成的任务。

Python 有大量的解析器生成器,绝大多数允许基于正则表达式的自定义分词器,这比包含大量单个字符的列表要方便得多。您可能会发现其中一个更适合您的目的。据我所知,你的语言可以用一个简单的 top-down 预测解析器(LL(1) 或“递归下降”)来解析,所以任何 general-purpose 解析器生成器都应该可以工作,即使是PEG 生成器。