将包含代码的字符串解析为 python 中的列表/树
Parsing a string containing code into a list / tree in python
如标题所示,我正在尝试将一段代码解析为树或列表。
首先,我要感谢为此付出的任何贡献和时间。
到目前为止,我的代码正在做我期望的事情,但我不确定这是执行此操作的最佳/最通用的方法。
问题
1. 我想要一个更通用的解决方案,因为将来我需要进一步分析这个语法。
2. 我现在无法将“=”或“>=”等运算符分开,如下面我分享的输出中所示。
将来我可能会将列表/树的内容从字符串更改为元组,这样我就可以识别运算符的类型(参数、比较,例如 = 或 >= ....)。但这不是现在真正的需要。
研究
我的第一次尝试是逐字符解析文本,但我的代码变得太乱了,几乎无法阅读,所以我认为我做错了什么(我没有代码可以在这里分享了)
所以我开始环顾人们是如何做的,发现了一些不一定能满足简单性和通用性要求的方法。
我会分享这些网站的链接,但我没有跟踪它们。
代码的语法
语法非常简单,毕竟我对类型或任何进一步的细节不感兴趣。只是功能和参数。
字符串定义为 'my string',变量定义为 !variable,数字定义为任何其他语言。
这是一个代码示例:
<pre><code>db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)
我的输出
这里我的输出是部分正确的,因为我仍然无法分离“='3'”部分(当然我必须分离它,因为在这种情况下它是一个比较运算符而不是字符串的一部分)
</p>
<pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]
期望的输出
<pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]
到目前为止我的代码
parseRecursive 方法是入口点。
<pre><code>import re
class FileParser:
#order is important to avoid miss splits
COMPARATOR_SIGN = {
'@='
,'@<>'
,'<>'
,'>='
,'<='
,'='
,'>'
,'<'
}
def __init__(self):
pass
def __charExistsInOccurences(self,current_needle, needles, text):
"""
check if other needles are present in text
current_needle : string -> the current needle being evaluated
needles : list -> list of needles
text : string/list<string> -> a string or a list of string to evaluate
"""
#if text is a string convert it to list of strings
text = text if isinstance(text, list) else [text]
exists = False
for t in text:
#check if needle is inside text value
for needle in needles:
#dont check the same key
if needle != current_needle:
regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#list of 1's and 0's . 1 if another character is found in the string.
found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
if sum(found) > 0:
exists = True
break
return exists
def findOperator(self, needles, haystack):
"""
split parameters from operators
needles : list -> list of operators
haystack : string
"""
string_open = haystack.find("'")
#if no string has been found set the index to 0
if string_open < 0:
string_open = 0
occurences = []
string_closure = haystack.rfind("'")
operator = ''
for needle in needles:
#regex to ignore the possible spaces between characters of the needle
split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#parse parameters before and after the string
before_string = re.split(split_regex, haystack[0:string_open])
after_string = re.split(split_regex, haystack[string_closure+1:])
#check if any other needle exists in the results found
before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)
#if the operator has been found merge the results with the occurences and assign the operator
if not before_string_exists and not after_string_exists:
occurences.extend(before_string)
occurences.extend([haystack[string_open:string_closure+1]])
occurences.extend(after_string)
operator = needle
#filter blank spaces generated
occurences = list(filter(lambda x: len(x.strip())>0,occurences))
result_check = [1 if x==haystack else 0 for x in occurences]
#if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
if len(result_check) == sum(result_check):
occurences= [haystack]
operator = ''
return operator, occurences
def parseRecursive(self,text):
"""
parse a block of text
text : string
"""
assert(len(text) < 1, "text is empty")
function_open = text.find('(')
accumulated_params = []
if function_open > -1:
#there is another function nested
text_prev_function = text[0:function_open]
#find last space coma or equal to retrieve the function name
last_space = -1
for j in range(len(text_prev_function)-1, 0 , -1):
if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
last_space = j
break
func_name = ''
if last_space > -1:
#there is something else behind the function name
func_name = text_prev_function[last_space+1:]
#no parentesis before so previous characters from function name are parameters
text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(',')))
text_prev_func_params = [x.strip() for x in text_prev_func_params]
#debug here
#accumulated_params.extend(text_prev_func_params)
for itext_prev in text_prev_func_params:
operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
if operator == '':
accumulated_params.extend(text_prev_operator)
else:
text_prev_operator.append(operator)
accumulated_params.extend(text_prev_operator)
#accumulated_params.extend(text_prev_operator)
else:
#function name is the start of the string
func_name = text_prev_function[0:].strip()
#find the closure of parentesis
function_close = text.rfind(')')
#parse the next function and extend the current list of parameters
next_func = text[function_open+1:function_close]
func_params = {func_name : self.parseRecursive(next_func)}
accumulated_params.append(func_params)
#
# parameters after the function
#
new_text = text[function_close+1:]
accumulated_params.extend(self.parseRecursive(new_text))
else:
#there is no other function nested
split_text = text.split(',')
current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
current_func_params = [x.strip() for x in current_func_params]
accumulated_params.extend(current_func_params)
#accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
return accumulated_params
text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))
你可以使用pyparsing来处理这种情况。
* pyparsing
可以通过 pip install pyparsing
安装
代码:
import pyparsing as pp
# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)
# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
stack = []
for e in elements:
if isinstance(e, list):
key = stack.pop()
stack.append({key: transform(e)})
else:
stack.append(e)
return stack
# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)
# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
# Show the result
print(result)
输出:
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
注:
- 如果
()
中有不平衡的括号(例如 a(b(c)
、a(b)c)
等),将获得意外结果或引发 IndexError
。所以在这种情况下要小心。
- 目前,只有一个示例可用于制作解析字符串的模式。因此,如果您遇到解析错误,请在您的问题中提供更多示例。
如标题所示,我正在尝试将一段代码解析为树或列表。 首先,我要感谢为此付出的任何贡献和时间。 到目前为止,我的代码正在做我期望的事情,但我不确定这是执行此操作的最佳/最通用的方法。
问题
1. 我想要一个更通用的解决方案,因为将来我需要进一步分析这个语法。 2. 我现在无法将“=”或“>=”等运算符分开,如下面我分享的输出中所示。 将来我可能会将列表/树的内容从字符串更改为元组,这样我就可以识别运算符的类型(参数、比较,例如 = 或 >= ....)。但这不是现在真正的需要。研究
我的第一次尝试是逐字符解析文本,但我的代码变得太乱了,几乎无法阅读,所以我认为我做错了什么(我没有代码可以在这里分享了) 所以我开始环顾人们是如何做的,发现了一些不一定能满足简单性和通用性要求的方法。 我会分享这些网站的链接,但我没有跟踪它们。
代码的语法
语法非常简单,毕竟我对类型或任何进一步的细节不感兴趣。只是功能和参数。 字符串定义为 'my string',变量定义为 !variable,数字定义为任何其他语言。 这是一个代码示例:
<pre><code>db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)
我的输出
这里我的输出是部分正确的,因为我仍然无法分离“='3'”部分(当然我必须分离它,因为在这种情况下它是一个比较运算符而不是字符串的一部分)
</p>
<pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]
期望的输出
<pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]
到目前为止我的代码
parseRecursive 方法是入口点。
<pre><code>import re
class FileParser:
#order is important to avoid miss splits
COMPARATOR_SIGN = {
'@='
,'@<>'
,'<>'
,'>='
,'<='
,'='
,'>'
,'<'
}
def __init__(self):
pass
def __charExistsInOccurences(self,current_needle, needles, text):
"""
check if other needles are present in text
current_needle : string -> the current needle being evaluated
needles : list -> list of needles
text : string/list<string> -> a string or a list of string to evaluate
"""
#if text is a string convert it to list of strings
text = text if isinstance(text, list) else [text]
exists = False
for t in text:
#check if needle is inside text value
for needle in needles:
#dont check the same key
if needle != current_needle:
regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#list of 1's and 0's . 1 if another character is found in the string.
found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
if sum(found) > 0:
exists = True
break
return exists
def findOperator(self, needles, haystack):
"""
split parameters from operators
needles : list -> list of operators
haystack : string
"""
string_open = haystack.find("'")
#if no string has been found set the index to 0
if string_open < 0:
string_open = 0
occurences = []
string_closure = haystack.rfind("'")
operator = ''
for needle in needles:
#regex to ignore the possible spaces between characters of the needle
split_regex = '\s*'+'\s*'.join(needle) + '\s*'
#parse parameters before and after the string
before_string = re.split(split_regex, haystack[0:string_open])
after_string = re.split(split_regex, haystack[string_closure+1:])
#check if any other needle exists in the results found
before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)
#if the operator has been found merge the results with the occurences and assign the operator
if not before_string_exists and not after_string_exists:
occurences.extend(before_string)
occurences.extend([haystack[string_open:string_closure+1]])
occurences.extend(after_string)
operator = needle
#filter blank spaces generated
occurences = list(filter(lambda x: len(x.strip())>0,occurences))
result_check = [1 if x==haystack else 0 for x in occurences]
#if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
if len(result_check) == sum(result_check):
occurences= [haystack]
operator = ''
return operator, occurences
def parseRecursive(self,text):
"""
parse a block of text
text : string
"""
assert(len(text) < 1, "text is empty")
function_open = text.find('(')
accumulated_params = []
if function_open > -1:
#there is another function nested
text_prev_function = text[0:function_open]
#find last space coma or equal to retrieve the function name
last_space = -1
for j in range(len(text_prev_function)-1, 0 , -1):
if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
last_space = j
break
func_name = ''
if last_space > -1:
#there is something else behind the function name
func_name = text_prev_function[last_space+1:]
#no parentesis before so previous characters from function name are parameters
text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(',')))
text_prev_func_params = [x.strip() for x in text_prev_func_params]
#debug here
#accumulated_params.extend(text_prev_func_params)
for itext_prev in text_prev_func_params:
operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
if operator == '':
accumulated_params.extend(text_prev_operator)
else:
text_prev_operator.append(operator)
accumulated_params.extend(text_prev_operator)
#accumulated_params.extend(text_prev_operator)
else:
#function name is the start of the string
func_name = text_prev_function[0:].strip()
#find the closure of parentesis
function_close = text.rfind(')')
#parse the next function and extend the current list of parameters
next_func = text[function_open+1:function_close]
func_params = {func_name : self.parseRecursive(next_func)}
accumulated_params.append(func_params)
#
# parameters after the function
#
new_text = text[function_close+1:]
accumulated_params.extend(self.parseRecursive(new_text))
else:
#there is no other function nested
split_text = text.split(',')
current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
current_func_params = [x.strip() for x in current_func_params]
accumulated_params.extend(current_func_params)
#accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
return accumulated_params
text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))
你可以使用pyparsing来处理这种情况。
* pyparsing
可以通过 pip install pyparsing
代码:
import pyparsing as pp
# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)
# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
stack = []
for e in elements:
if isinstance(e, list):
key = stack.pop()
stack.append({key: transform(e)})
else:
stack.append(e)
return stack
# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)
# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
# Show the result
print(result)
输出:
[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]
注:
- 如果
()
中有不平衡的括号(例如a(b(c)
、a(b)c)
等),将获得意外结果或引发IndexError
。所以在这种情况下要小心。 - 目前,只有一个示例可用于制作解析字符串的模式。因此,如果您遇到解析错误,请在您的问题中提供更多示例。