Python3 - 为编译器创建扫描器并在测试时出错

Python3 - Creating a scanner for a compiler and getting errors upon testing

我正在尝试为读取简单语言的编译器创建一个扫描器。我创建了一个名为 program 的测试文件,其中包含:

z := 2;
if z < 3 then
   z := 1
end

对于运行程序,我使用终端,运行命令行:

python3 scanner.py program tokens

我希望将输出放入文本文件标记中,但是执行此操作时没有任何显示。在 运行 时间内,程序 运行s 但没有做任何事情。我试图将 <> 放在程序周围,但我得到了一个 ValueError: need more than 1 value to unpack.

我的代码如下:

import re
import sys

class Scanner:
    '''The interface comprises the methods lookahead and consume.
      Other methods should not be called from outside of this class.'''

 def __init__(self, input_file):
  '''Reads the whole input_file to input_string, which remains constant.
     current_char_index counts how many characters of input_string have
     been consumed.
     current_token holds the most recently found token and the
     corresponding part of input_string.'''

    # source code of the program to be compiled
    self.input_string = input_file.read()

    # index where the unprocessed part of input_string starts
    self.current_char_index = 0

    # a pair (most recently read token, matched substring of input_string)
    self.current_token = self.get_token()

 def skip_white_space(self):
    '''Consumes all characters in input_string up to the next
      non-white-space character.'''
    if (self.current_char_index >= len(self.input_string) - 1):
        return

    while self.input_string[self.current_char_index].isspace():
        self.current_char_index += 1

 def get_token(self):
    '''Returns the next token and the part of input_string it matched.
      The returned token is None if there is no next token.
      The characters up to the end of the token are consumed.'''
    self.skip_white_space()
    # find the longest prefix of input_string that matches a token
    token, longest = None, ''
    for (t, r) in Token.token_regexp:
        match = re.match(r, self.input_string[self.current_char_index:])
        if match and match.end() > len(longest):
            token, longest = t, match.group()
    # consume the token by moving the index to the end of the matched part
    self.current_char_index += len(longest)
    return (token, longest)

 def lookahead(self):
    '''Returns the next token without consuming it.
      Returns None if there is no next token.'''
    return self.current_token[0]

 def consume(self, *tokens):
    '''Returns the next token and consumes it, if it is in tokens.
      Raises an exception otherwise.
      If the token is a number or an identifier, its value is returned
      instead of the token.'''
    current = self.current_token

    if (len(self.input_string[self.current_char_index:]) == 0):
        self.current_token = (None, '')         # catches the end-of-file errors so lookahead returns none.
    else:
        self.current_token = self.get_token()   # otherwise we consume the token

    if current[0] in tokens:         # tokens could be a single token, or it could be group of tokens.
        if current[0] is Token.ID or current[0] is Token.NUM:     # if token is ID or NUM
            return current[1]                   # return the value of the ID or NUM
        else:                                   # otherwise
            return current[0]                   # return the token
    else:                                       # if current_token is not in tokens
        raise Exception('non-token detected')   # raise non-token error

class Token:
 # The following enumerates all tokens.
 DO    = 'DO'
 ELSE  = 'ELSE'
 READ  = 'READ'
 WRITE = 'WRITE'
 END   = 'END'
 IF    = 'IF'
 THEN  = 'THEN'
 WHILE = 'WHILE'
 SEM   = 'SEM'
 BEC   = 'BEC'
 LESS  = 'LESS'
 EQ    = 'EQ'
 GRTR  = 'GRTR'
 LEQ   = 'LEQ'
 NEQ   = 'NEQ'
 GEQ   = 'GEQ'
 ADD   = 'ADD'
 SUB   = 'SUB'
 MUL   = 'MUL'
 DIV   = 'DIV'
 LPAR  = 'LPAR'
 RPAR  = 'RPAR'
 NUM   = 'NUM'
 ID    = 'ID'

 # The following list gives the regular expression to match a token.
 # The order in the list matters for mimicking Flex behaviour.
 # Longer matches are preferred over shorter ones.
 #  For same-length matches, the first in the list is preferred.
 token_regexp = [
  (DO,    'do'),
  (ELSE,  'else'),
  (READ,    'read'),
  (WRITE,  'write'),
  (END,   'end'),
  (IF,    'if'),
  (THEN,  'then'),
  (WHILE, 'while'),
  (SEM,   ';'),
  (BEC,   ':='),
  (LESS,  '<'),
  (EQ,    '='),
  (NEQ,    '!='),
  (GRTR,  '>'),
  (LEQ,   '<='),
  (GEQ,   '>='),
  (ADD,   '[+]'), # + is special in regular expressions
  (SUB,   '-'),
  (MUL,   '[*]'),
  (DIV,   '[/]'),
  (LPAR,  '[(]'), # ( is special in regular expressions
  (RPAR,  '[)]'), # ) is special in regular expressions
  (ID,    '[a-z]+'),
  (NUM,   '[0-9]+'),
]

 def indent(s, level):
   return '    '*level + s + '\n'

# Initialise scanner.

scanner = Scanner(sys.stdin)

# Show all tokens in the input.

token = scanner.lookahead()
test = ''

while token != None:
 if token in [Token.NUM, Token.ID]:
   token, value = scanner.consume(token)
   print(token, value)
 else:
   print(scanner.consume(token))
 token = scanner.lookahead()

抱歉,如果解释不当。对出了什么问题的任何帮助都会很棒。谢谢

解决方案 1a

我想通了为什么它没有打印到文件标记。我需要将我的测试代码更改为此

while token != None:
 print(scanner.consume(token))
 token = scanner.lookahead()

现在唯一的问题是当它是 ID 或 NUM 时我无法读取,它只打印出标识或数字而不说明它是什么。现在,它打印出这个:

z
BEC
2
SEM
IF
z
LESS
3
THEN
z
BEC
1
END

我需要它来打印这个

NUM z
BEC
ID 2
SEM
IF
ID z
LESS
NUM 3
THEN
ID z
BEC
NUM 1
END

我正在考虑添加一个 if 语句,声明如果它是 NUM,则打印 NUM 后跟令牌,如果它是 ID,同样如此。

解决方案 1b

我只是在 consume 中添加了一个 if 和 elif 语句来打印 NUM 和 ID。例如,如果 current[0] 是 Token.ID 那么 return "ID " + current[1].

除了白色我没有改变任何东西space和消费,我很难达到运行...

def skip_white_space(自我): '''消耗 input_string 中的所有字符直到下一个 非白色-space 字符。'''

    while self.input_string[self.current_char_index] == '\s':
        self.current_char_index += 1   

def consume(self, *tokens): '''Returns 下一个令牌并使用它,如果它在令牌中。 否则引发异常。 如果令牌是数字或标识符,而不仅仅是令牌 但是返回了一对令牌及其值。'''
当前 = self.current_token

    if current[0] in tokens:         
        if current[0] in Token.ID:     
            return 'ID' + current[1]
        elif current[0] in Token.NUM:
            return 'NUM' + current[1]
        else:
            return current[0]
    else:                                       
        raise Exception('Error in compiling non-token(not apart of token list)')

...我在尝试让 python3 scanner.py <程序> 令牌工作时遇到特别困难,任何指导都会对我有很大帮助,thanx