如何根据编译器设计忽略字符串中的注释?

How can i ignore comments in a string based on compiler design?

我想忽略所有评论,例如 { comments }// comments。 我有一个名为 peek 的指针,它逐个字符 字符 检查我的字符串。我知道如何忽略换行符、制表符和空格,但我不知道如何忽略 comments.

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''

for i, peek in enumerate(string.lower()):
    if peek == ' ' or peek == '\n':
        tokens.append(tmp)
        # ignoing WS's and comments
        if(len(tmp)>0): 
            print(tmp)

        tmp = ''
    
    else:
        tmp += peek

这是我的结果:

begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end

如您所见,空格会被忽略,但注释不会。

我怎样才能得到如下结果?

begin
west
west
north
north
north
west
east
east
south
end

只需使用全局变量 skip = False 并在获得 { 时将其设置为 True 并在获得 } 时设置 False 以及您的其余部分if/else 运行 在 if not skip:

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    comment\n
}\n end
"""

tokens = []
tmp = ''
skip = False

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip = True
    elif peek == '}':
        skip = False
    elif not skip:

        if peek == ' ' or peek == '\n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

因为你可能嵌套了{ { } }喜欢

{\n
    { comment1 }\n
    comment2\n
    { comment3 }\n
}\n

所以最好用skip来算{}

string =  """  beGIn west   WEST north//comment1 \n
north       north west East east south\n
// comment west\n
{\n
    { comment1 }\n
    comment2\n
    { comment3 }\n
}\n end
"""

tokens = []
tmp = ''
skip = 0

for i, peek in enumerate(string.lower()):

    if peek == '{':
        skip += 1
    elif peek == '}':
        skip -= 1
    elif not skip:  # elif skip == 0:

        if peek == ' ' or peek == '\n':
            tokens.append(tmp)
            # ignoing WS's and comments
            if(len(tmp)>0): 
                print(tmp)
            tmp = ''
        else:
            tmp += peek

但也许最好将所有内容都设为 tokens,然后再过滤 tokens。但我跳过这个想法。


编辑:

使用 Python 模块 sly 的版本类似于 C/C++ 工具 lex/yacc

MULTI_LINE_COMMENT 的正则表达式我在其他构建解析器的工具中找到 - lark:

syntax for multiline comments

from sly import Lexer, Parser

class MyLexer(Lexer):
    # Create it befor defining regex for Tokens
    tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }

    ignore = ' \t'

    # Tokens
    NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
    ONE_LINE_COMMENT = '\/\/.*'
    MULTI_LINE_COMMENT = '{(.|\n)*}'

    # Ignored pattern
    ignore_newline = r'\n+'

    # Extra action for newlines
    def ignore_newline(self, t):
        self.lineno += t.value.count('\n')

    # Work with errors
    def error(self, t):
        print("Illegal character '%s'" % t.value[0])
        self.index += 1

if __name__ == '__main__':
    
    text =  """  beGIn west   WEST north//comment1 
north       north west East east south
// comment west
{
    { comment1 }
    comment2
    { comment3 }
}
 end
"""
    
    lexer = MyLexer()
    tokens = lexer.tokenize(text)
    for item in tokens:
        print(item.type, ':', item.value)

结果:

NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1 
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
    { comment1 }
    comment2
    { comment3 }
}
NAME : end