如何将具有递归和交替的正则语法转换为正则表达式

Question

如果文法是右线性的或左线性的，则该文法是正则的。 This tutorial 声称正因为如此，它具有特殊的属性:

A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side... The reduced expression of terminals and operators can be written in an even more compact form, called a regular expression

所以我决定测试这个想法并将正则 EcmaScript grammar for IdentifierName 转换为正则表达式：

IdentifierName ::
    IdentifierStart
    IdentifierName  IdentifierPart

假设IdentifierStart和IdentifierPart限于以下情况：

IdentifierStart ::       IdentifierPart ::
    A                        A                 
    B                        C
    C                        &
    $                    
    _

但我不确定如何继续，因为 IdentifierName 的语法既有递归又有交替。有帮助吗？

我对这种方法更感兴趣，而不是找到结果正则表达式，正如@Bergi 所显示的那样 [ABC$_][AC&]*。

Answer 1

该教程使用了一些非标准的（令人惊讶的是隐含的）定义。

首先，他们在语法中使用重复运算符，因为它们可能出现在正则表达式或 EBNF 中。然后他们隐含地定义了一个正则语法，它只使用那些重复运算符而不使用递归。鉴于此，只需内联所有非终端即可将 "regular grammar" 转换为正则表达式。但是根据该定义，JS 规范的标识符语法是不规则的，因为它包含递归。因此，在您可以内联所有内容之前，您首先需要用重复运算符替换递归。

然而，这并不是常规语法的标准定义。标准定义如您所说：如果语法是左线性或右线性，则语法是规则的-也就是说，如果只有生产的最左边的项目是非终结符，或者只有最右边的项目是。正式语法的通常定义中不存在重复运算符。

现在这些正则文法也可以转化为正则表达式，但不是仅仅应用教程中介绍的方法。一种方法是将文法转换为有限自动机，然后应用 this answer 中描述的算法。

然而在实践中，当手动进行转换（而不是编写程序来进行转换）时，执行转换的最简单和最常见的方法是考虑语法描述的是什么语言（在这种情况下 "the language of all words that start with an IdentifierStart symbol and then contain 0 or more IdentifierPart symbols")，然后想出一个表达该语言的正则表达式（a.k.a。"look really hard at the problem until you see the solution"-算法）。

如何将具有递归和交替的正则语法转换为正则表达式

How to convert regular grammar with recursion and alternations into regular expression

javascript

compiler-construction

ecmascript-next