Python3 - 为编译器创建扫描器并在测试时出错
Python3 - Creating a scanner for a compiler and getting errors upon testing
我正在尝试为读取简单语言的编译器创建一个扫描器。我创建了一个名为 program 的测试文件,其中包含:
z := 2;
if z < 3 then
z := 1
end
对于运行程序,我使用终端,运行命令行:
python3 scanner.py program tokens
我希望将输出放入文本文件标记中,但是执行此操作时没有任何显示。在 运行 时间内,程序 运行s 但没有做任何事情。我试图将 <> 放在程序周围,但我得到了一个 ValueError: need more than 1 value to unpack.
我的代码如下:
import re
import sys
class Scanner:
'''The interface comprises the methods lookahead and consume.
Other methods should not be called from outside of this class.'''
def __init__(self, input_file):
'''Reads the whole input_file to input_string, which remains constant.
current_char_index counts how many characters of input_string have
been consumed.
current_token holds the most recently found token and the
corresponding part of input_string.'''
# source code of the program to be compiled
self.input_string = input_file.read()
# index where the unprocessed part of input_string starts
self.current_char_index = 0
# a pair (most recently read token, matched substring of input_string)
self.current_token = self.get_token()
def skip_white_space(self):
'''Consumes all characters in input_string up to the next
non-white-space character.'''
if (self.current_char_index >= len(self.input_string) - 1):
return
while self.input_string[self.current_char_index].isspace():
self.current_char_index += 1
def get_token(self):
'''Returns the next token and the part of input_string it matched.
The returned token is None if there is no next token.
The characters up to the end of the token are consumed.'''
self.skip_white_space()
# find the longest prefix of input_string that matches a token
token, longest = None, ''
for (t, r) in Token.token_regexp:
match = re.match(r, self.input_string[self.current_char_index:])
if match and match.end() > len(longest):
token, longest = t, match.group()
# consume the token by moving the index to the end of the matched part
self.current_char_index += len(longest)
return (token, longest)
def lookahead(self):
'''Returns the next token without consuming it.
Returns None if there is no next token.'''
return self.current_token[0]
def consume(self, *tokens):
'''Returns the next token and consumes it, if it is in tokens.
Raises an exception otherwise.
If the token is a number or an identifier, its value is returned
instead of the token.'''
current = self.current_token
if (len(self.input_string[self.current_char_index:]) == 0):
self.current_token = (None, '') # catches the end-of-file errors so lookahead returns none.
else:
self.current_token = self.get_token() # otherwise we consume the token
if current[0] in tokens: # tokens could be a single token, or it could be group of tokens.
if current[0] is Token.ID or current[0] is Token.NUM: # if token is ID or NUM
return current[1] # return the value of the ID or NUM
else: # otherwise
return current[0] # return the token
else: # if current_token is not in tokens
raise Exception('non-token detected') # raise non-token error
class Token:
# The following enumerates all tokens.
DO = 'DO'
ELSE = 'ELSE'
READ = 'READ'
WRITE = 'WRITE'
END = 'END'
IF = 'IF'
THEN = 'THEN'
WHILE = 'WHILE'
SEM = 'SEM'
BEC = 'BEC'
LESS = 'LESS'
EQ = 'EQ'
GRTR = 'GRTR'
LEQ = 'LEQ'
NEQ = 'NEQ'
GEQ = 'GEQ'
ADD = 'ADD'
SUB = 'SUB'
MUL = 'MUL'
DIV = 'DIV'
LPAR = 'LPAR'
RPAR = 'RPAR'
NUM = 'NUM'
ID = 'ID'
# The following list gives the regular expression to match a token.
# The order in the list matters for mimicking Flex behaviour.
# Longer matches are preferred over shorter ones.
# For same-length matches, the first in the list is preferred.
token_regexp = [
(DO, 'do'),
(ELSE, 'else'),
(READ, 'read'),
(WRITE, 'write'),
(END, 'end'),
(IF, 'if'),
(THEN, 'then'),
(WHILE, 'while'),
(SEM, ';'),
(BEC, ':='),
(LESS, '<'),
(EQ, '='),
(NEQ, '!='),
(GRTR, '>'),
(LEQ, '<='),
(GEQ, '>='),
(ADD, '[+]'), # + is special in regular expressions
(SUB, '-'),
(MUL, '[*]'),
(DIV, '[/]'),
(LPAR, '[(]'), # ( is special in regular expressions
(RPAR, '[)]'), # ) is special in regular expressions
(ID, '[a-z]+'),
(NUM, '[0-9]+'),
]
def indent(s, level):
return ' '*level + s + '\n'
# Initialise scanner.
scanner = Scanner(sys.stdin)
# Show all tokens in the input.
token = scanner.lookahead()
test = ''
while token != None:
if token in [Token.NUM, Token.ID]:
token, value = scanner.consume(token)
print(token, value)
else:
print(scanner.consume(token))
token = scanner.lookahead()
抱歉,如果解释不当。对出了什么问题的任何帮助都会很棒。谢谢
解决方案 1a
我想通了为什么它没有打印到文件标记。我需要将我的测试代码更改为此
while token != None:
print(scanner.consume(token))
token = scanner.lookahead()
现在唯一的问题是当它是 ID 或 NUM 时我无法读取,它只打印出标识或数字而不说明它是什么。现在,它打印出这个:
z
BEC
2
SEM
IF
z
LESS
3
THEN
z
BEC
1
END
我需要它来打印这个
NUM z
BEC
ID 2
SEM
IF
ID z
LESS
NUM 3
THEN
ID z
BEC
NUM 1
END
我正在考虑添加一个 if 语句,声明如果它是 NUM,则打印 NUM 后跟令牌,如果它是 ID,同样如此。
解决方案 1b
我只是在 consume 中添加了一个 if 和 elif 语句来打印 NUM 和 ID。例如,如果 current[0] 是 Token.ID 那么 return "ID " + current[1].
除了白色我没有改变任何东西space和消费,我很难达到运行...
def skip_white_space(自我):
'''消耗 input_string 中的所有字符直到下一个
非白色-space 字符。'''
while self.input_string[self.current_char_index] == '\s':
self.current_char_index += 1
def consume(self, *tokens):
'''Returns 下一个令牌并使用它,如果它在令牌中。
否则引发异常。
如果令牌是数字或标识符,而不仅仅是令牌
但是返回了一对令牌及其值。'''
当前 = self.current_token
if current[0] in tokens:
if current[0] in Token.ID:
return 'ID' + current[1]
elif current[0] in Token.NUM:
return 'NUM' + current[1]
else:
return current[0]
else:
raise Exception('Error in compiling non-token(not apart of token list)')
...我在尝试让 python3 scanner.py <程序> 令牌工作时遇到特别困难,任何指导都会对我有很大帮助,thanx
我正在尝试为读取简单语言的编译器创建一个扫描器。我创建了一个名为 program 的测试文件,其中包含:
z := 2;
if z < 3 then
z := 1
end
对于运行程序,我使用终端,运行命令行:
python3 scanner.py program tokens
我希望将输出放入文本文件标记中,但是执行此操作时没有任何显示。在 运行 时间内,程序 运行s 但没有做任何事情。我试图将 <> 放在程序周围,但我得到了一个 ValueError: need more than 1 value to unpack.
我的代码如下:
import re
import sys
class Scanner:
'''The interface comprises the methods lookahead and consume.
Other methods should not be called from outside of this class.'''
def __init__(self, input_file):
'''Reads the whole input_file to input_string, which remains constant.
current_char_index counts how many characters of input_string have
been consumed.
current_token holds the most recently found token and the
corresponding part of input_string.'''
# source code of the program to be compiled
self.input_string = input_file.read()
# index where the unprocessed part of input_string starts
self.current_char_index = 0
# a pair (most recently read token, matched substring of input_string)
self.current_token = self.get_token()
def skip_white_space(self):
'''Consumes all characters in input_string up to the next
non-white-space character.'''
if (self.current_char_index >= len(self.input_string) - 1):
return
while self.input_string[self.current_char_index].isspace():
self.current_char_index += 1
def get_token(self):
'''Returns the next token and the part of input_string it matched.
The returned token is None if there is no next token.
The characters up to the end of the token are consumed.'''
self.skip_white_space()
# find the longest prefix of input_string that matches a token
token, longest = None, ''
for (t, r) in Token.token_regexp:
match = re.match(r, self.input_string[self.current_char_index:])
if match and match.end() > len(longest):
token, longest = t, match.group()
# consume the token by moving the index to the end of the matched part
self.current_char_index += len(longest)
return (token, longest)
def lookahead(self):
'''Returns the next token without consuming it.
Returns None if there is no next token.'''
return self.current_token[0]
def consume(self, *tokens):
'''Returns the next token and consumes it, if it is in tokens.
Raises an exception otherwise.
If the token is a number or an identifier, its value is returned
instead of the token.'''
current = self.current_token
if (len(self.input_string[self.current_char_index:]) == 0):
self.current_token = (None, '') # catches the end-of-file errors so lookahead returns none.
else:
self.current_token = self.get_token() # otherwise we consume the token
if current[0] in tokens: # tokens could be a single token, or it could be group of tokens.
if current[0] is Token.ID or current[0] is Token.NUM: # if token is ID or NUM
return current[1] # return the value of the ID or NUM
else: # otherwise
return current[0] # return the token
else: # if current_token is not in tokens
raise Exception('non-token detected') # raise non-token error
class Token:
# The following enumerates all tokens.
DO = 'DO'
ELSE = 'ELSE'
READ = 'READ'
WRITE = 'WRITE'
END = 'END'
IF = 'IF'
THEN = 'THEN'
WHILE = 'WHILE'
SEM = 'SEM'
BEC = 'BEC'
LESS = 'LESS'
EQ = 'EQ'
GRTR = 'GRTR'
LEQ = 'LEQ'
NEQ = 'NEQ'
GEQ = 'GEQ'
ADD = 'ADD'
SUB = 'SUB'
MUL = 'MUL'
DIV = 'DIV'
LPAR = 'LPAR'
RPAR = 'RPAR'
NUM = 'NUM'
ID = 'ID'
# The following list gives the regular expression to match a token.
# The order in the list matters for mimicking Flex behaviour.
# Longer matches are preferred over shorter ones.
# For same-length matches, the first in the list is preferred.
token_regexp = [
(DO, 'do'),
(ELSE, 'else'),
(READ, 'read'),
(WRITE, 'write'),
(END, 'end'),
(IF, 'if'),
(THEN, 'then'),
(WHILE, 'while'),
(SEM, ';'),
(BEC, ':='),
(LESS, '<'),
(EQ, '='),
(NEQ, '!='),
(GRTR, '>'),
(LEQ, '<='),
(GEQ, '>='),
(ADD, '[+]'), # + is special in regular expressions
(SUB, '-'),
(MUL, '[*]'),
(DIV, '[/]'),
(LPAR, '[(]'), # ( is special in regular expressions
(RPAR, '[)]'), # ) is special in regular expressions
(ID, '[a-z]+'),
(NUM, '[0-9]+'),
]
def indent(s, level):
return ' '*level + s + '\n'
# Initialise scanner.
scanner = Scanner(sys.stdin)
# Show all tokens in the input.
token = scanner.lookahead()
test = ''
while token != None:
if token in [Token.NUM, Token.ID]:
token, value = scanner.consume(token)
print(token, value)
else:
print(scanner.consume(token))
token = scanner.lookahead()
抱歉,如果解释不当。对出了什么问题的任何帮助都会很棒。谢谢
解决方案 1a
我想通了为什么它没有打印到文件标记。我需要将我的测试代码更改为此
while token != None:
print(scanner.consume(token))
token = scanner.lookahead()
现在唯一的问题是当它是 ID 或 NUM 时我无法读取,它只打印出标识或数字而不说明它是什么。现在,它打印出这个:
z
BEC
2
SEM
IF
z
LESS
3
THEN
z
BEC
1
END
我需要它来打印这个
NUM z
BEC
ID 2
SEM
IF
ID z
LESS
NUM 3
THEN
ID z
BEC
NUM 1
END
我正在考虑添加一个 if 语句,声明如果它是 NUM,则打印 NUM 后跟令牌,如果它是 ID,同样如此。
解决方案 1b
我只是在 consume 中添加了一个 if 和 elif 语句来打印 NUM 和 ID。例如,如果 current[0] 是 Token.ID 那么 return "ID " + current[1].
除了白色我没有改变任何东西space和消费,我很难达到运行...
def skip_white_space(自我): '''消耗 input_string 中的所有字符直到下一个 非白色-space 字符。'''
while self.input_string[self.current_char_index] == '\s':
self.current_char_index += 1
def consume(self, *tokens):
'''Returns 下一个令牌并使用它,如果它在令牌中。
否则引发异常。
如果令牌是数字或标识符,而不仅仅是令牌
但是返回了一对令牌及其值。'''
当前 = self.current_token
if current[0] in tokens:
if current[0] in Token.ID:
return 'ID' + current[1]
elif current[0] in Token.NUM:
return 'NUM' + current[1]
else:
return current[0]
else:
raise Exception('Error in compiling non-token(not apart of token list)')
...我在尝试让 python3 scanner.py <程序> 令牌工作时遇到特别困难,任何指导都会对我有很大帮助,thanx