如何在抽象语法树之前编辑解析树？

Question

我想了解如何有效地使用 stdlib parser module, since sometimes ast.parse 丢失了太多信息（它占用了空格、注释、额外的括号等 - 与源代码格式化程序相关的详细信息，举个例子）

>>> parser.expr('(*x,)').tolist()
[258,
 [332,
  [306,
   [310,
    [311,
     [312,
      [313,
       [316,
        [317,
         [318,
          [319,
           [320,
            [321,
             [322,
              [323,
               [324,
                [325,
                 [7, '('],
                 [326,
                  [315,
                   [16, '*'],
                   [316,
                    [317,
                     [318,
                      [319,
                       [320, [321, [322, [323, [324, [325, [1, 'x']]]]]]]]]]]],
                  [12, ',']],
                 [8, ')']]]]]]]]]]]]]]]]],
 [4, ''],
 [0, '']]

什么是all those numbers about and how do the relate to the grammar？你应该如何解释这个解析树的结构和嵌套？有没有办法用缩进和符号名称而不是代码来漂亮地打印它？

Answer 1

您在标题中提出了一个问题，在 body 中提出了一些不同的问题。此答案主要解决了 body 中的问题，因为我不确定您正在寻求哪种 pre-AST 转换。

我怀疑你的标题问题的答案要复杂得多。例如，parse 模块也不保留格式或注释（尽管它确实标识了标记的 line/column 数字，如果需要，您可以从中导出 non-comment 行的水平格式).如果你想写一个彩色查看器，你会想要使用 tokenize 模块。该模块不解析，因此如果您还需要解析树，则必须使用 parse 或 ast，然后将标记与 [=14= 返回的标记流相关联].

What are all those numbers about and how do the relate to the grammar?

它们要么识别 non-terminals（语法产生式），要么识别终端（标记）。这两个类别的数值范围不重叠，因此不会混淆。

How should you interpret the structure and nesting of this parse tree?

它代表完整的、未编辑的（据我所知）从根产生式开始的语法树（eval_input 或 file_input，取决于你调用的是 parser.expr 还是parser.suite)。终端节点以终端索引号开头，后跟令牌的文本表示，如果需要，再后跟位置信息。 Non-terminal 节点以 non-terminal 索引号开头，然后是 child 节点。（显然总是至少有一个 child 节点；Python 语法没有可空 non-terminals。）

请注意，大多数 Python 语法树的嵌套非常深，因为语法有很多单元产生式。 ast 模块所做的部分工作是压缩单元生产链。

Is there a way to pretty-print it with indentation and symbol names instead of codes?

当然可以：

import parse
import pprint
import symbol
import token

def symbolic(root):
  if root[0] in token.tok_name:
    return [token.tok_name[root[0]], *root[1:]]
  elif root[0] in symbol.sym_name:
    return [symbol.sym_name[root[0]], *map(symbolic, root[1:])]
  else:
    # Not optimal since it doesn't symbolise children, if any.
    # But it should never happen, anyway.
    return root

>>> pprint(symbolic(parser.expr("a if True else b").tolist()))
['eval_input',
 ['testlist',
  ['test',
   ['or_test',
    ['and_test',
     ['not_test',
      ['comparison',
       ['expr',
        ['xor_expr',
         ['and_expr',
          ['shift_expr',
           ['arith_expr',
            ['term',
             ['factor',
              ['power', ['atom_expr', ['atom', ['NAME', 'a']]]]]]]]]]]]]]],
   ['NAME', 'if'],
   ['or_test',
    ['and_test',
     ['not_test',
      ['comparison',
       ['expr',
        ['xor_expr',
         ['and_expr',
          ['shift_expr',
           ['arith_expr',
            ['term',
             ['factor',
              ['power', ['atom_expr', ['atom', ['NAME', 'True']]]]]]]]]]]]]]],
   ['NAME', 'else'],
   ['test',
    ['or_test',
     ['and_test',
      ['not_test',
       ['comparison',
        ['expr',
         ['xor_expr',
          ['and_expr',
           ['shift_expr',
            ['arith_expr',
             ['term',
              ['factor',
               ['power', ['atom_expr', ['atom', ['NAME', 'b']]]]]]]]]]]]]]]]]],
 ['NEWLINE', ''],
 ['ENDMARKER', '']]

作为这棵树如何与文法对齐的例子，上面输出的第三行以test为首的子树对应于文法产生式

test: or_test ['if' or_test 'else' test]

这是一个 file_input 解析的例子：

 >>> pprint(symbolic(parser.suite("import sys").tolist()))
['file_input',
 ['stmt',
  ['simple_stmt',
   ['small_stmt',
    ['import_stmt',
     ['import_name',
      ['NAME', 'import'],
      ['dotted_as_names',
       ['dotted_as_name', ['dotted_name', ['NAME', 'sys']]]]]]],
   ['NEWLINE', '']]],
 ['NEWLINE', ''],
 ['ENDMARKER', '']]

如何在抽象语法树之前编辑解析树？

How to edit parse trees before the abstract syntax tree?

python

grammar

parsing

tokenize

如何在抽象语法树*之前*编辑解析树？

How to edit parse trees *before* the abstract syntax tree?

python

grammar

parsing

tokenize

如何在抽象语法树之前编辑解析树？

How to edit parse trees before the abstract syntax tree?