Antlr 解析 python 安装文件
Antlr to parse python setup file
我有一个 java 程序,它必须解析 python setup.py 文件以从中提取信息。我有点工作,但我碰壁了。我首先从一个简单的原始文件开始,一旦我得到 运行ning,然后我会担心去除我不想让它反映实际文件的噪音。
这是我的语法
grammar SetupPy ;
file_input: (NEWLINE | setupDeclaration)* EOF;
setupDeclaration : 'setup' '(' method ')';
method : setupRequires testRequires;
setupRequires : 'setup_requires' '=' '[' LISTVAL* ']' COMMA;
testRequires : 'tests_require' '=' '[' LISTVAL* ']' COMMA;
WS: [ \t\n\r]+ -> skip ;
COMMA : ',' -> skip ;
LISTVAL : SHORT_STRING ;
UNKNOWN_CHAR
: .
;
fragment SHORT_STRING
: '\'' ( STRING_ESCAPE_SEQ | ~[\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\r\n\f"] )* '"'
;
/// stringescapeseq ::= "\" <any source character>
fragment STRING_ESCAPE_SEQ
: '\' .
| '\' NEWLINE
;
fragment SPACES
: [ \t]+
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(Python3Parser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
和我当前的测试文件(就像我说的,下一步是从普通文件中去除噪音)
setup(
setup_requires=['pytest-runner'],
tests_require=['pytest', 'unittest2'],
)
我卡住的地方是如何告诉 antlr setup_requires 和 tests_requires 包含数组。我想要这些数组的值,无论是否有人使用单引号、双引号、不同行上的每个值以及上述所有内容的组合。我不知道如何做到这一点。我能得到一些帮助吗?也许是一两个例子?
注意事项,
- 不,我不能使用 jython,只能使用 运行 文件。
- Regex 不是一个选项,因为此文件的开发人员风格差异很大
当然,在这个问题之后,我仍然需要弄清楚如何从普通文件中去除噪音。我尝试使用 Python3 语法来做到这一点,但我是 antlr 的新手,它让我震惊。我不知道如何编写规则来提取值,所以我决定尝试一种更简单的语法。并迅速撞到另一堵墙。
编辑
这是一个最终必须解析的实际 setup.py 文件。请记住 setup_requires 和 test_requires 可能存在也可能不存在,也可能不按该顺序排列。
# -*- coding: utf-8 -*-
from __future__ import with_statement
from setuptools import setup
def get_version(fname='mccabe.py'):
with open(fname) as f:
for line in f:
if line.startswith('__version__'):
return eval(line.split('=')[-1])
def get_long_description():
descr = []
for fname in ('README.rst',):
with open(fname) as f:
descr.append(f.read())
return '\n\n'.join(descr)
setup(
name='mccabe',
version=get_version(),
description="McCabe checker, plugin for flake8",
long_description=get_long_description(),
keywords='flake8 mccabe',
author='Tarek Ziade',
author_email='tarek@ziade.org',
maintainer='Ian Cordasco',
maintainer_email='graffatcolmingov@gmail.com',
url='https://github.com/pycqa/mccabe',
license='Expat license',
py_modules=['mccabe'],
zip_safe=False,
setup_requires=['pytest-runner'],
tests_require=['pytest'],
entry_points={
'flake8.extension': [
'C90 = mccabe:McCabeChecker',
],
},
classifiers=[
'Development Status :: 5 - Production/Stable',
'Environment :: Console',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Topic :: Software Development :: Libraries :: Python Modules',
'Topic :: Software Development :: Quality Assurance',
],
)
尝试调试和简化并意识到我不需要找到方法,只需要找到值。所以我在玩这个语法
grammar SetupPy ;
file_input: (ignore setupRequires ignore | ignore testRequires ignore )* EOF;
setupRequires : 'setup_requires' '=' '[' dependencyValue* (',' dependencyValue)* ']';
testRequires : 'tests_require' '=' '[' dependencyValue* (',' dependencyValue)* ']';
dependencyValue: LISTVAL;
ignore : UNKNOWN_CHAR? ;
LISTVAL: SHORT_STRING;
UNKNOWN_CHAR: . -> channel(HIDDEN);
fragment SHORT_STRING: '\'' ( STRING_ESCAPE_SEQ | ~[\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\r\n\f"] )* '"';
fragment STRING_ESCAPE_SEQ
: '\' .
| '\'
;
非常适合简单的,甚至可以处理乱序问题。但对整个文件不起作用,它挂在
def get_version(fname='mccabe.py'):
该行中的等号。
I've examined your grammar and simplified it quite a bit. I took out all the python-esqe whitespace handling and just treated whitespace as whitespace. This grammar also parses this input, which as you said in the question, handles one item per line, single and double quotes, etc...
setup(
setup_requires=['pytest-runner'],
tests_require=['pytest',
'unittest2',
"test_3" ],
)
And here's the much simplified grammar:
grammar SetupPy ;
setupDeclaration : 'setup' '(' method ')' EOF;
method : setupRequires testRequires ;
setupRequires : 'setup_requires' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
testRequires : 'tests_require' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
WS: [ \t\n\r]+ -> skip ;
LISTVAL : SHORT_STRING ;
fragment SHORT_STRING
: '\'' ( STRING_ESCAPE_SEQ | ~[\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\r\n\f"] )* '"'
;
fragment STRING_ESCAPE_SEQ
: '\' .
| '\'
;
Oh and here's the parser-lexer output showing the correct assignment of tokens:
[@0,0:4='setup',<'setup'>,1:0]
[@1,5:5='(',<'('>,1:5]
[@2,12:25='setup_requires',<'setup_requires'>,2:4]
[@3,26:26='=',<'='>,2:18]
[@4,27:27='[',<'['>,2:19]
[@5,28:42=''pytest-runner'',<LISTVAL>,2:20]
[@6,43:43=']',<']'>,2:35]
[@7,44:44=',',<','>,2:36]
[@8,51:63='tests_require',<'tests_require'>,3:4]
[@9,64:64='=',<'='>,3:17]
[@10,65:65='[',<'['>,3:18]
[@11,66:73=''pytest'',<LISTVAL>,3:19]
[@12,74:74=',',<','>,3:27]
[@13,79:89=''unittest2'',<LISTVAL>,4:1]
[@14,90:90=',',<','>,4:12]
[@15,95:102='"test_3"',<LISTVAL>,5:1]
[@16,104:104=']',<']'>,5:10]
[@17,105:105=',',<','>,5:11]
[@18,108:108=')',<')'>,6:0]
[@19,109:108='<EOF>',<EOF>,6:1]
Now you should be able to follow a simple ANTLR Visitor or Listener pattern to grab up your LISTVAL
tokens and do your thing with them. I hope this meets your needs. It certainly parses your test input well, and more.
我有一个 java 程序,它必须解析 python setup.py 文件以从中提取信息。我有点工作,但我碰壁了。我首先从一个简单的原始文件开始,一旦我得到 运行ning,然后我会担心去除我不想让它反映实际文件的噪音。
这是我的语法
grammar SetupPy ;
file_input: (NEWLINE | setupDeclaration)* EOF;
setupDeclaration : 'setup' '(' method ')';
method : setupRequires testRequires;
setupRequires : 'setup_requires' '=' '[' LISTVAL* ']' COMMA;
testRequires : 'tests_require' '=' '[' LISTVAL* ']' COMMA;
WS: [ \t\n\r]+ -> skip ;
COMMA : ',' -> skip ;
LISTVAL : SHORT_STRING ;
UNKNOWN_CHAR
: .
;
fragment SHORT_STRING
: '\'' ( STRING_ESCAPE_SEQ | ~[\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\r\n\f"] )* '"'
;
/// stringescapeseq ::= "\" <any source character>
fragment STRING_ESCAPE_SEQ
: '\' .
| '\' NEWLINE
;
fragment SPACES
: [ \t]+
;
NEWLINE
: ( {atStartOfInput()}? SPACES
| ( '\r'? '\n' | '\r' | '\f' ) SPACES?
)
{
String newLine = getText().replaceAll("[^\r\n\f]+", "");
String spaces = getText().replaceAll("[\r\n\f]+", "");
int next = _input.LA(1);
if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
// If we're inside a list or on a blank line, ignore all indents,
// dedents and line breaks.
skip();
}
else {
emit(commonToken(NEWLINE, newLine));
int indent = getIndentationCount(spaces);
int previous = indents.isEmpty() ? 0 : indents.peek();
if (indent == previous) {
// skip indents of the same size as the present indent-size
skip();
}
else if (indent > previous) {
indents.push(indent);
emit(commonToken(Python3Parser.INDENT, spaces));
}
else {
// Possibly emit more than 1 DEDENT token.
while(!indents.isEmpty() && indents.peek() > indent) {
this.emit(createDedent());
indents.pop();
}
}
}
}
;
和我当前的测试文件(就像我说的,下一步是从普通文件中去除噪音)
setup(
setup_requires=['pytest-runner'],
tests_require=['pytest', 'unittest2'],
)
我卡住的地方是如何告诉 antlr setup_requires 和 tests_requires 包含数组。我想要这些数组的值,无论是否有人使用单引号、双引号、不同行上的每个值以及上述所有内容的组合。我不知道如何做到这一点。我能得到一些帮助吗?也许是一两个例子?
注意事项,
- 不,我不能使用 jython,只能使用 运行 文件。
- Regex 不是一个选项,因为此文件的开发人员风格差异很大
当然,在这个问题之后,我仍然需要弄清楚如何从普通文件中去除噪音。我尝试使用 Python3 语法来做到这一点,但我是 antlr 的新手,它让我震惊。我不知道如何编写规则来提取值,所以我决定尝试一种更简单的语法。并迅速撞到另一堵墙。
编辑 这是一个最终必须解析的实际 setup.py 文件。请记住 setup_requires 和 test_requires 可能存在也可能不存在,也可能不按该顺序排列。
# -*- coding: utf-8 -*-
from __future__ import with_statement
from setuptools import setup
def get_version(fname='mccabe.py'):
with open(fname) as f:
for line in f:
if line.startswith('__version__'):
return eval(line.split('=')[-1])
def get_long_description():
descr = []
for fname in ('README.rst',):
with open(fname) as f:
descr.append(f.read())
return '\n\n'.join(descr)
setup(
name='mccabe',
version=get_version(),
description="McCabe checker, plugin for flake8",
long_description=get_long_description(),
keywords='flake8 mccabe',
author='Tarek Ziade',
author_email='tarek@ziade.org',
maintainer='Ian Cordasco',
maintainer_email='graffatcolmingov@gmail.com',
url='https://github.com/pycqa/mccabe',
license='Expat license',
py_modules=['mccabe'],
zip_safe=False,
setup_requires=['pytest-runner'],
tests_require=['pytest'],
entry_points={
'flake8.extension': [
'C90 = mccabe:McCabeChecker',
],
},
classifiers=[
'Development Status :: 5 - Production/Stable',
'Environment :: Console',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Topic :: Software Development :: Libraries :: Python Modules',
'Topic :: Software Development :: Quality Assurance',
],
)
尝试调试和简化并意识到我不需要找到方法,只需要找到值。所以我在玩这个语法
grammar SetupPy ;
file_input: (ignore setupRequires ignore | ignore testRequires ignore )* EOF;
setupRequires : 'setup_requires' '=' '[' dependencyValue* (',' dependencyValue)* ']';
testRequires : 'tests_require' '=' '[' dependencyValue* (',' dependencyValue)* ']';
dependencyValue: LISTVAL;
ignore : UNKNOWN_CHAR? ;
LISTVAL: SHORT_STRING;
UNKNOWN_CHAR: . -> channel(HIDDEN);
fragment SHORT_STRING: '\'' ( STRING_ESCAPE_SEQ | ~[\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\r\n\f"] )* '"';
fragment STRING_ESCAPE_SEQ
: '\' .
| '\'
;
非常适合简单的,甚至可以处理乱序问题。但对整个文件不起作用,它挂在
def get_version(fname='mccabe.py'):
该行中的等号。
I've examined your grammar and simplified it quite a bit. I took out all the python-esqe whitespace handling and just treated whitespace as whitespace. This grammar also parses this input, which as you said in the question, handles one item per line, single and double quotes, etc...
setup(
setup_requires=['pytest-runner'],
tests_require=['pytest',
'unittest2',
"test_3" ],
)
And here's the much simplified grammar:
grammar SetupPy ;
setupDeclaration : 'setup' '(' method ')' EOF;
method : setupRequires testRequires ;
setupRequires : 'setup_requires' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
testRequires : 'tests_require' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
WS: [ \t\n\r]+ -> skip ;
LISTVAL : SHORT_STRING ;
fragment SHORT_STRING
: '\'' ( STRING_ESCAPE_SEQ | ~[\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\r\n\f"] )* '"'
;
fragment STRING_ESCAPE_SEQ
: '\' .
| '\'
;
Oh and here's the parser-lexer output showing the correct assignment of tokens:
[@0,0:4='setup',<'setup'>,1:0]
[@1,5:5='(',<'('>,1:5]
[@2,12:25='setup_requires',<'setup_requires'>,2:4]
[@3,26:26='=',<'='>,2:18]
[@4,27:27='[',<'['>,2:19]
[@5,28:42=''pytest-runner'',<LISTVAL>,2:20]
[@6,43:43=']',<']'>,2:35]
[@7,44:44=',',<','>,2:36]
[@8,51:63='tests_require',<'tests_require'>,3:4]
[@9,64:64='=',<'='>,3:17]
[@10,65:65='[',<'['>,3:18]
[@11,66:73=''pytest'',<LISTVAL>,3:19]
[@12,74:74=',',<','>,3:27]
[@13,79:89=''unittest2'',<LISTVAL>,4:1]
[@14,90:90=',',<','>,4:12]
[@15,95:102='"test_3"',<LISTVAL>,5:1]
[@16,104:104=']',<']'>,5:10]
[@17,105:105=',',<','>,5:11]
[@18,108:108=')',<')'>,6:0]
[@19,109:108='<EOF>',<EOF>,6:1]
Now you should be able to follow a simple ANTLR Visitor or Listener pattern to grab up your LISTVAL
tokens and do your thing with them. I hope this meets your needs. It certainly parses your test input well, and more.