如何根据编译器设计忽略字符串中的注释?
How can i ignore comments in a string based on compiler design?
我想忽略所有评论,例如 { comments }
和 // comments
。
我有一个名为 peek 的指针,它逐个字符 字符 检查我的字符串。我知道如何忽略换行符、制表符和空格,但我不知道如何忽略 comments.
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
comment\n
}\n end
"""
tokens = []
tmp = ''
for i, peek in enumerate(string.lower()):
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
这是我的结果:
begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end
如您所见,空格会被忽略,但注释不会。
我怎样才能得到如下结果?
begin
west
west
north
north
north
west
east
east
south
end
只需使用全局变量 skip = False
并在获得 {
时将其设置为 True
并在获得 }
时设置 False
以及您的其余部分if/else
运行 在 if not skip:
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
comment\n
}\n end
"""
tokens = []
tmp = ''
skip = False
for i, peek in enumerate(string.lower()):
if peek == '{':
skip = True
elif peek == '}':
skip = False
elif not skip:
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
因为你可能嵌套了{ { } }
喜欢
{\n
{ comment1 }\n
comment2\n
{ comment3 }\n
}\n
所以最好用skip
来算{
}
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
{ comment1 }\n
comment2\n
{ comment3 }\n
}\n end
"""
tokens = []
tmp = ''
skip = 0
for i, peek in enumerate(string.lower()):
if peek == '{':
skip += 1
elif peek == '}':
skip -= 1
elif not skip: # elif skip == 0:
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
但也许最好将所有内容都设为 tokens
,然后再过滤 tokens
。但我跳过这个想法。
编辑:
使用 Python 模块 sly 的版本类似于 C/C++ 工具 lex
/yacc
MULTI_LINE_COMMENT
的正则表达式我在其他构建解析器的工具中找到 - lark
:
from sly import Lexer, Parser
class MyLexer(Lexer):
# Create it befor defining regex for Tokens
tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }
ignore = ' \t'
# Tokens
NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
ONE_LINE_COMMENT = '\/\/.*'
MULTI_LINE_COMMENT = '{(.|\n)*}'
# Ignored pattern
ignore_newline = r'\n+'
# Extra action for newlines
def ignore_newline(self, t):
self.lineno += t.value.count('\n')
# Work with errors
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
if __name__ == '__main__':
text = """ beGIn west WEST north//comment1
north north west East east south
// comment west
{
{ comment1 }
comment2
{ comment3 }
}
end
"""
lexer = MyLexer()
tokens = lexer.tokenize(text)
for item in tokens:
print(item.type, ':', item.value)
结果:
NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
{ comment1 }
comment2
{ comment3 }
}
NAME : end
我想忽略所有评论,例如 { comments }
和 // comments
。
我有一个名为 peek 的指针,它逐个字符 字符 检查我的字符串。我知道如何忽略换行符、制表符和空格,但我不知道如何忽略 comments.
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
comment\n
}\n end
"""
tokens = []
tmp = ''
for i, peek in enumerate(string.lower()):
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
这是我的结果:
begin
west
west
north//
comment1
north
north
west
east
east
south
{
comment2
}
end
如您所见,空格会被忽略,但注释不会。
我怎样才能得到如下结果?
begin
west
west
north
north
north
west
east
east
south
end
只需使用全局变量 skip = False
并在获得 {
时将其设置为 True
并在获得 }
时设置 False
以及您的其余部分if/else
运行 在 if not skip:
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
comment\n
}\n end
"""
tokens = []
tmp = ''
skip = False
for i, peek in enumerate(string.lower()):
if peek == '{':
skip = True
elif peek == '}':
skip = False
elif not skip:
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
因为你可能嵌套了{ { } }
喜欢
{\n
{ comment1 }\n
comment2\n
{ comment3 }\n
}\n
所以最好用skip
来算{
}
string = """ beGIn west WEST north//comment1 \n
north north west East east south\n
// comment west\n
{\n
{ comment1 }\n
comment2\n
{ comment3 }\n
}\n end
"""
tokens = []
tmp = ''
skip = 0
for i, peek in enumerate(string.lower()):
if peek == '{':
skip += 1
elif peek == '}':
skip -= 1
elif not skip: # elif skip == 0:
if peek == ' ' or peek == '\n':
tokens.append(tmp)
# ignoing WS's and comments
if(len(tmp)>0):
print(tmp)
tmp = ''
else:
tmp += peek
但也许最好将所有内容都设为 tokens
,然后再过滤 tokens
。但我跳过这个想法。
编辑:
使用 Python 模块 sly 的版本类似于 C/C++ 工具 lex
/yacc
MULTI_LINE_COMMENT
的正则表达式我在其他构建解析器的工具中找到 - lark
:
from sly import Lexer, Parser
class MyLexer(Lexer):
# Create it befor defining regex for Tokens
tokens = { NAME, ONE_LINE_COMMENT, MULTI_LINE_COMMENT }
ignore = ' \t'
# Tokens
NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
ONE_LINE_COMMENT = '\/\/.*'
MULTI_LINE_COMMENT = '{(.|\n)*}'
# Ignored pattern
ignore_newline = r'\n+'
# Extra action for newlines
def ignore_newline(self, t):
self.lineno += t.value.count('\n')
# Work with errors
def error(self, t):
print("Illegal character '%s'" % t.value[0])
self.index += 1
if __name__ == '__main__':
text = """ beGIn west WEST north//comment1
north north west East east south
// comment west
{
{ comment1 }
comment2
{ comment3 }
}
end
"""
lexer = MyLexer()
tokens = lexer.tokenize(text)
for item in tokens:
print(item.type, ':', item.value)
结果:
NAME : beGIn
NAME : west
NAME : WEST
NAME : north
ONE_LINE_COMMENT : //comment1
NAME : north
NAME : north
NAME : west
NAME : East
NAME : east
NAME : south
ONE_LINE_COMMENT : // comment west
MULTI_LINE_COMMENT : {
{ comment1 }
comment2
{ comment3 }
}
NAME : end