如何获得带有注释的语法树?

How to get a syntax tree with comments?

我正在尝试为多种语言创建文档生成器。为此,我需要一个 AST,以便知道,例如,这条评论是针对 class 而这条评论是针对此 class.

的方法

我开始编写这个简单的 Python 代码,它通过递归查看来显示树:

import sys
import antlr4
from ECMAScriptLexer import ECMAScriptLexer
from ECMAScriptParser import ECMAScriptParser

def handleTree(tree, lvl=0):
    for child in tree.getChildren():
        if isinstance(child, antlr4.tree.Tree.TerminalNode):
            print(lvl*'│ ' + '└─', child)
        else:
            handleTree(child, lvl+1)

input = antlr4.FileStream(sys.argv[1])
lexer = ECMAScriptLexer(input)
stream = antlr4.CommonTokenStream(lexer)
parser = ECMAScriptParser(stream)
tree = parser.program()
handleTree(tree)

并尝试使用 antlr EcmaScript grammar:

解析此 Javascript 代码
var i = 52; // inline comment

function foo() {
  /** The foo documentation */
  console.log('hey');
}

这输出:

│ │ │ │ └─ var
│ │ │ │ │ │ └─ i
│ │ │ │ │ │ │ └─ =
│ │ │ │ │ │ │ │ │ │ └─ 52
│ │ │ │ │ └─ ;
│ │ │ └─ function
│ │ │ └─ foo
│ │ │ └─ (
│ │ │ └─ )
│ │ │ └─ {
│ │ │ │ │ │ │ │ │ │ │ │ └─ console
│ │ │ │ │ │ │ │ │ │ │ └─ .
│ │ │ │ │ │ │ │ │ │ │ │ └─ log
│ │ │ │ │ │ │ │ │ │ │ └─ (
│ │ │ │ │ │ │ │ │ │ │ │ │ │ └─ 'hey'
│ │ │ │ │ │ │ │ │ │ │ └─ )
│ │ │ │ │ │ │ │ │ └─ ;
│ │ │ └─ }
└─ <EOF>

所有的评论都被忽略了,可能是因为channel(HIDDEN) in the grammar.

的存在

经过一番谷歌搜索后,我发现 this 的答案是:

Unless you have a very compelling reason to put the comment inside the parser (which I'd like to hear), you should put it in the lexer.

那么,为什么解析器中不应包含评论以及如何获得包含评论的树?

So, why comments should not be included in the parser and how to get a tree including comments?

如果从规则 MultiLineComment

中删除 -> channel(HIDDEN)
MultiLineComment
 : '/*' .*? '*/' -> channel(HIDDEN)
 ;

那么 MultiLineComment 将在解析器中结束。但是,您的每个解析器规则都需要在允许的地方包含这些标记。

例如,采用arrayLiteral解析器规则:

/// ArrayLiteral :
///     [ Elision? ]
///     [ ElementList ]
///     [ ElementList , Elision? ]
arrayLiteral
 : '[' elementList? ','? elision? ']'
 ;

因为这是 JavaScript 中的有效数组文字:

[/* ... */ 1, 2 /* ... */ , 3 /* ... */ /* ... */]

这意味着您需要使用 MultiLineComment 标记来乱丢所有解析器规则,如下所示:

/// ArrayLiteral :
///     [ Elision? ]
///     [ ElementList ]
///     [ ElementList , Elision? ]
arrayLiteral
 : '[' MultiLineComment* elementList? MultiLineComment* ','? MultiLineComment* elision? MultiLineComment* ']'
 ;

会变成一团乱麻。

编辑

来自评论:

So it's not possible to generate a tree including comments with antlr? Is there some hacks or other libraries to do this?

GRosenberg 的回答:

Antlr provides a convenience method for this task: BufferedTokenStream#getHiddenTokensToLeft. In walking the parse tree, access the stream to obtain the node associated comment, if any. Use BufferedTokenStream#getHiddenTokensToRight to get any trailing comment.