将包含代码的字符串解析为 python 中的列表/树

Parsing a string containing code into a list / tree in python

如标题所示,我正在尝试将一段代码解析为树或列表。 首先,我要感谢为此付出的任何贡献和时间。 到目前为止,我的代码正在做我期望的事情,但我不确定这是执行此操作的最佳/最通用的方法。

问题

1. 我想要一个更通用的解决方案,因为将来我需要进一步分析这个语法。 2. 我现在无法将“=”或“>=”等运算符分开,如下面我分享的输出中所示。 将来我可能会将列表/树的内容从字符串更改为元组,这样我就可以识别运算符的类型(参数、比较,例如 = 或 >= ....)。但这不是现在真正的需要。

研究

我的第一次尝试是逐字符解析文本,但我的代码变得太乱了,几乎无法阅读,所以我认为我做错了什么(我没有代码可以在这里分享了) 所以我开始环顾人们是如何做的,发现了一些不一定能满足简单性和通用性要求的方法。 我会分享这些网站的链接,但我没有跟踪它们。

代码的语法

语法非常简单,毕竟我对类型或任何进一步的细节不感兴趣。只是功能和参数。 字符串定义为 'my string',变量定义为 !variable,数字定义为任何其他语言。 这是一个代码示例: <pre><code>db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)

我的输出

这里我的输出是部分正确的,因为我仍然无法分离“='3'”部分(当然我必须分离它,因为在这种情况下它是一个比较运算符而不是字符串的一部分) </p> <pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]

期望的输出

<pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]

到目前为止我的代码

parseRecursive 方法是入口点。

<pre><code>import re class FileParser: #order is important to avoid miss splits COMPARATOR_SIGN = { '@=' ,'@<>' ,'<>' ,'>=' ,'<=' ,'=' ,'>' ,'<' } def __init__(self): pass def __charExistsInOccurences(self,current_needle, needles, text): """ check if other needles are present in text current_needle : string -> the current needle being evaluated needles : list -> list of needles text : string/list<string> -> a string or a list of string to evaluate """ #if text is a string convert it to list of strings text = text if isinstance(text, list) else [text] exists = False for t in text: #check if needle is inside text value for needle in needles: #dont check the same key if needle != current_needle: regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*' #list of 1's and 0's . 1 if another character is found in the string. found = [1 if re.search(regex_search_needle, x) else 0 for x in t] if sum(found) > 0: exists = True break return exists def findOperator(self, needles, haystack): """ split parameters from operators needles : list -> list of operators haystack : string """ string_open = haystack.find("'") #if no string has been found set the index to 0 if string_open < 0: string_open = 0 occurences = [] string_closure = haystack.rfind("'") operator = '' for needle in needles: #regex to ignore the possible spaces between characters of the needle split_regex = '\s*'+'\s*'.join(needle) + '\s*' #parse parameters before and after the string before_string = re.split(split_regex, haystack[0:string_open]) after_string = re.split(split_regex, haystack[string_closure+1:]) #check if any other needle exists in the results found before_string_exists = self.__charExistsInOccurences(needle, needles, before_string) after_string_exists = self.__charExistsInOccurences(needle, needles, after_string) #if the operator has been found merge the results with the occurences and assign the operator if not before_string_exists and not after_string_exists: occurences.extend(before_string) occurences.extend([haystack[string_open:string_closure+1]]) occurences.extend(after_string) operator = needle #filter blank spaces generated occurences = list(filter(lambda x: len(x.strip())>0,occurences)) result_check = [1 if x==haystack else 0 for x in occurences] #if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part if len(result_check) == sum(result_check): occurences= [haystack] operator = '' return operator, occurences def parseRecursive(self,text): """ parse a block of text text : string """ assert(len(text) < 1, "text is empty") function_open = text.find('(') accumulated_params = [] if function_open > -1: #there is another function nested text_prev_function = text[0:function_open] #find last space coma or equal to retrieve the function name last_space = -1 for j in range(len(text_prev_function)-1, 0 , -1): if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=': last_space = j break func_name = '' if last_space > -1: #there is something else behind the function name func_name = text_prev_function[last_space+1:] #no parentesis before so previous characters from function name are parameters text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(','))) text_prev_func_params = [x.strip() for x in text_prev_func_params] #debug here #accumulated_params.extend(text_prev_func_params) for itext_prev in text_prev_func_params: operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev) if operator == '': accumulated_params.extend(text_prev_operator) else: text_prev_operator.append(operator) accumulated_params.extend(text_prev_operator) #accumulated_params.extend(text_prev_operator) else: #function name is the start of the string func_name = text_prev_function[0:].strip() #find the closure of parentesis function_close = text.rfind(')') #parse the next function and extend the current list of parameters next_func = text[function_open+1:function_close] func_params = {func_name : self.parseRecursive(next_func)} accumulated_params.append(func_params) # # parameters after the function # new_text = text[function_close+1:] accumulated_params.extend(self.parseRecursive(new_text)) else: #there is no other function nested split_text = text.split(',') current_func_params = list(filter(lambda x: len(x.strip())>0,split_text)) current_func_params = [x.strip() for x in current_func_params] accumulated_params.extend(current_func_params) #accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params)) return accumulated_params text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)" obj = FileParser() print(obj.parseRecursive(text))

你可以使用pyparsing来处理这种情况。
* pyparsing 可以通过 pip install pyparsing

安装

代码:

import pyparsing as pp

# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)

# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
    stack = []
    for e in elements:
        if isinstance(e, list):
            key = stack.pop()
            stack.append({key: transform(e)})
        else:
            stack.append(e)
    return stack

# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"

# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)

# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

# Show the result
print(result)

输出:

[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

注:

  • 如果 () 中有不平衡的括号(例如 a(b(c)a(b)c) 等),将获得意外结果或引发 IndexError。所以在这种情况下要小心。
  • 目前,只有一个示例可用于制作解析字符串的模式。因此,如果您遇到解析错误,请在您的问题中提供更多示例。