Lex 上的换行错误 - Python

Question

我用 PLY 构建的词法分析器有问题。

当我将 for 循环的代码传递到我的程序时，无法识别 { 和 } 之间的换行符。而是报告错误，即使有 t_newline(t) 函数。

程序的输入是：

for(int i = 0 ; i < 5 ; i++){
}

而且，程序的输出是

 1 . analizadorLexico.py
 2 . analizadorSintactico.py
 3 . parser.out
 4 . parsetab.py
 5 . prueba1.txt
 6 . cpp.py
 7 . ctokens.py
 8 . lex.py
 9 . yacc.py
 10 . ygen.py
 11 . __init__.py
 12 . lex.cpython-36.pyc
 13 . yacc.cpython-36.pyc
 14 . __init__.cpython-36.pyc
 15 . analizadorLexico.cpython-36.pyc
 16 . parsetab.cpython-36.pyc

File number: 5
5
Escogido el archivoprueba1.txt
LexToken(FOR,'FOR',1,0)
LexToken(PA,'(',1,3)
LexToken(INT,'INT',1,4)
LexToken(ID,'i',1,8)
LexToken(ASSIGN,'=',1,10)
LexToken(NUMBER,0,1,12)
LexToken(END,';',1,14)
LexToken(ID,'i',1,16)
LexToken(LT,'<',1,18)
LexToken(NUMBER,5,1,20)
LexToken(END,';',1,22)
LexToken(ID,'i',1,24)
LexToken(PLUS,'+',1,25)
LexToken(PLUS,'+',1,26)
LexToken(PC,')',1,27)
LexToken(CA,'{',1,28)
Error in '
'
LexToken(CC,'}',2,31)

密码是：

reservados = ['FOR','AND','OR','NOT','XOR', 'INT', 'FLOAT', 'DOUBLE', 
'SHORT','LONG', 'BOOL']
tokens = reservados + [
        'ID',
        'NUMBER',
        'PLUS',
        'MINUS',
        'TIMES',
        'DIVIDE',
        'DIVE',
        'ASSIGN',
        'LT',
        'MA',
        'LTE',
        'MAE',
        'DIF',
        'PA',
        'PC',
        'ANDC',
        #'ORC',
        'NOTC',
        'MOD',
        'CMP',
        'END',
        'COMMA',
        'CA',
        'CC',
        #'ES'

]
t_ignore = ' \t'
t_ignore_WHITESPACES = r'[ \t]+'
t_PLUS = r'\+'
t_MINUS = r'-'
t_TIMES = r'\*'
t_DIVIDE = r'/'
t_ASSIGN = r'='
t_LT = r'<'
t_MA = r'>'
t_LTE = r'<='
t_MAE = r'>='
t_DIF = r'\!='
t_PA = r'\('
t_PC = r'\)'
t_ANDC = r'\&&'
#t_ORC = r'\||'
t_NOTC = r'\!'
t_DIVE = r'\'
t_MOD = r'\%'
t_CMP = r'=='
t_END = r'\;'
t_COMMA = r'\,'
t_CA = r'{'
t_CC = r'}'
#t_ES = r'\ '

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

def t_ID(t):
    r'[a-zA-Z_][a-zA-Z0-9_]*'
    """
        CONVIERTE CUALQUIER IDENTIFICADOR EN MAYUSCULA EN CASO DE QUE SE 
        HAYA ESCRITO ASÍ
    """
    if t.value.upper() in reservados:
        t.value = t.value.upper()
        t.type = t.value

    return t

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)    
    return t

def t_error(t):
    print ("Error de sintaxis '%s'" % t.value[0])
    t.lexer.skip(1)


def buscarFicheros(directorio):
    ficheros = []

    numArchivo = ''
    respuesta = False
    cont = 1

    for dirName, subdirList, fileList in os.walk(directorio):
        #print('Directorio encontrado: %s' % dirName)
        for fname in fileList:
            ficheros.append(fname)

    for file in ficheros:
        print ("",cont,".",file)
        cont = cont + 1

    while respuesta == False:
        numArchivo = input('\nNumero del archivo: ')
        print (numArchivo)
        for file in ficheros:
            if file == ficheros[int(numArchivo) - 1]:
                respuesta = True
                break

    print ("Escogido el archivo" + ficheros[int(numArchivo) - 1])
    return ficheros[int(numArchivo) - 1]

directorio = r'C:/Users/Carlos/Desktop/for c++/'
archivo = buscarFicheros(directorio)
test = directorio + archivo

fp = codecs.open(test, "r", "utf-8")
cadena = fp.read()
fp.close()

analizador = lex.lex()
analizador.input(cadena)

while True:
    tok = analizador.token()
    if not tok : break
    print (tok)

感谢您的帮助

Answer 1

我认为最有可能的解释是错误是由 Windows 行结尾 \r\n 引起的。 \r 不在您要忽略的字符列表中，但没有规则处理它，因此它会触发错误。

如果这是问题所在，最简单的解决方案是将 \r 添加到 t_ignore。（我认为同时拥有 t_ignore 和 t_ignore_WHITESPACES 没有任何意义，所以我建议您删除其中一个。）

但是，我无法重现您提供的错误输出。 post 中的代码似乎没有任何可能输出字符串 Error in '... 的函数，因此这可能只是粘贴了不同版本代码的输出结果。

Lex 上的换行错误 - Python

Error with new line on Lex - Python

python

grammar

ply

lexical-analysis