将包含代码的字符串解析为 python 中的列表/树

Question

如标题所示，我正在尝试将一段代码解析为树或列表。首先，我要感谢为此付出的任何贡献和时间。到目前为止，我的代码正在做我期望的事情，但我不确定这是执行此操作的最佳/最通用的方法。

问题

1. 我想要一个更通用的解决方案，因为将来我需要进一步分析这个语法。 2. 我现在无法将“=”或“>=”等运算符分开，如下面我分享的输出中所示。将来我可能会将列表/树的内容从字符串更改为元组，这样我就可以识别运算符的类型（参数、比较，例如 = 或 >= ....）。但这不是现在真正的需要。

研究

我的第一次尝试是逐字符解析文本，但我的代码变得太乱了，几乎无法阅读，所以我认为我做错了什么（我没有代码可以在这里分享了）所以我开始环顾人们是如何做的，发现了一些不一定能满足简单性和通用性要求的方法。我会分享这些网站的链接，但我没有跟踪它们。

代码的语法

语法非常简单，毕竟我对类型或任何进一步的细节不感兴趣。只是功能和参数。字符串定义为 'my string'，变量定义为 !variable，数字定义为任何其他语言。这是一个代码示例：


<pre><code>db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)

我的输出

这里我的输出是部分正确的，因为我仍然无法分离“='3'”部分（当然我必须分离它，因为在这种情况下它是一个比较运算符而不是字符串的一部分） </p> <pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "= '3'", "'4'", "'5'"]}, '6']}]

期望的输出


<pre><code>[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, "=", "'3'", "'4'", "'5'"]}, '6']}]

到目前为止我的代码

parseRecursive 方法是入口点。


<pre><code>import re

class FileParser:

    #order is important to avoid miss splits
    COMPARATOR_SIGN = {
        '@='
        ,'@<>'
        ,'<>'
        ,'>='
        ,'<='
        ,'='
        ,'>'
        ,'<'
    }

    def __init__(self):
        pass

    def __charExistsInOccurences(self,current_needle, needles, text):
        """
        check if other needles are present in text
        current_needle : string -> the current needle being evaluated
        needles : list -> list of needles
        text : string/list<string> -> a string or a list of string to evaluate
        """
        #if text is a string convert it to list of strings
        text = text if isinstance(text, list) else [text]
        
        exists = False

        for t in text:
            #check if needle is inside text value
            for needle in needles:
                    #dont check the same key
                    if needle != current_needle:
                        regex_search_needle = split_regex = '\s*'+'\s*'.join(needle) + '\s*'
                        #list of 1's and 0's . 1 if another character is found in the string.
                        found = [1 if re.search(regex_search_needle, x) else 0 for x in t]
                        if sum(found) > 0:
                            exists = True
                            break

        return exists
        

    def findOperator(self, needles, haystack):
        """
        split parameters from operators
        needles : list -> list of operators
        haystack : string
        """
        string_open = haystack.find("'")
        
        #if no string has been found set the index to 0
        if string_open < 0:
            string_open = 0

        occurences = []

        string_closure = haystack.rfind("'")
        operator = ''
        for needle in needles:
            #regex to ignore the possible spaces between characters of the needle
            split_regex = '\s*'+'\s*'.join(needle) + '\s*'
            
            #parse parameters before and after the string
            before_string = re.split(split_regex, haystack[0:string_open])
            after_string = re.split(split_regex, haystack[string_closure+1:])


            #check if any other needle exists in the results found
            before_string_exists = self.__charExistsInOccurences(needle, needles, before_string)
            after_string_exists = self.__charExistsInOccurences(needle, needles, after_string)

            #if the operator has been found merge the results with the occurences and assign the operator
            if not before_string_exists and not after_string_exists:
                occurences.extend(before_string)
                occurences.extend([haystack[string_open:string_closure+1]])
                occurences.extend(after_string)
                operator = needle
        
        #filter blank spaces generated
        occurences = list(filter(lambda x: len(x.strip())>0,occurences))
        result_check = [1 if x==haystack else 0 for x in occurences]
        #if the haystack was originaly a simple string like '1' the occurences list is going to be filled with the same character over and over due to the before string an after string part
        if len(result_check) == sum(result_check):
            occurences= [haystack]
            operator = ''

        return operator, occurences
 




    def parseRecursive(self,text):
        """
        parse a block of text
        text : string 
        """

        assert(len(text) < 1, "text is empty")

        function_open = text.find('(')
        accumulated_params = []
        if function_open > -1:
            #there is another function nested
            text_prev_function = text[0:function_open]
            
            #find last space coma or equal to retrieve the function name
            last_space = -1
            for j in range(len(text_prev_function)-1, 0 , -1):
                if text_prev_function[j] == ' ' or text_prev_function[j] == ',' or text_prev_function[j] == '=':
                    last_space = j
                    break

            func_name = ''

            if last_space > -1:
                #there is something else behind the function name
                func_name = text_prev_function[last_space+1:]
                #no parentesis before so previous characters from function name are parameters
                text_prev_func_params = list(filter(lambda x: len(x.strip())>0,text_prev_function[:last_space+1].split(',')))
                text_prev_func_params = [x.strip() for x in text_prev_func_params]
                #debug here
                #accumulated_params.extend(text_prev_func_params)

                for itext_prev in text_prev_func_params:
                    operator, text_prev_operator = self.findOperator(self.COMPARATOR_SIGN,itext_prev)
                    if operator == '':
                        accumulated_params.extend(text_prev_operator)
                    else:
                        text_prev_operator.append(operator)
                        accumulated_params.extend(text_prev_operator)
                    
                #accumulated_params.extend(text_prev_operator)
            else:
                #function name is the start of the string
                func_name = text_prev_function[0:].strip()
            
            #find the closure of parentesis
            function_close = text.rfind(')')
            #parse the next function and extend the current list of parameters
            next_func = text[function_open+1:function_close]
            func_params = {func_name : self.parseRecursive(next_func)}
            accumulated_params.append(func_params)

            #
            # parameters after the function 
            #
            new_text = text[function_close+1:]
            accumulated_params.extend(self.parseRecursive(new_text))
        else:
            #there is no other function nested
            split_text = text.split(',')
            current_func_params = list(filter(lambda x: len(x.strip())>0,split_text))
            current_func_params = [x.strip() for x in current_func_params]
            accumulated_params.extend(current_func_params)
        
        #accumulated_params = list(filter(lambda x: len(x.strip())>0,accumulated_params))
        return accumulated_params

text = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"
obj = FileParser()
print(obj.parseRecursive(text))

Answer 1

你可以使用pyparsing来处理这种情况。
* pyparsing 可以通过 pip install pyparsing

安装

代码：

import pyparsing as pp

# A parsing pattern
w = pp.Regex(r'(?:![^(),]+)|[^(), ]+') ^ pp.Suppress(',')
pattern = w + pp.nested_expr('(', ')', content=w)

# A recursive function to transform a pyparsing result into your desirable format
def transform(elements):
    stack = []
    for e in elements:
        if isinstance(e, list):
            key = stack.pop()
            stack.append({key: transform(e)})
        else:
            stack.append(e)
    return stack

# A sample
string = "db('1', '2', if(ATTRS('Dim 1', !Element Structure, 'ID') = '3','4','5'), 6)"

# Operations to parse the sample string
elements = pattern.parse_string(string).as_list()
result = transform(elements)

# Assertion
assert result == [{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

# Show the result
print(result)

输出：

[{'db': ["'1'", "'2'", {'if': [{'ATTRS': ["'Dim 1'", '!Element Structure', "'ID'"]}, '=', "'3'", "'4'", "'5'"]}, '6']}]

注：

如果 () 中有不平衡的括号（例如 a(b(c)、a(b)c) 等），将获得意外结果或引发 IndexError。所以在这种情况下要小心。
目前，只有一个示例可用于制作解析字符串的模式。因此，如果您遇到解析错误，请在您的问题中提供更多示例。

将包含代码的字符串解析为 python 中的列表/树

Parsing a string containing code into a list / tree in python

python

string

parsing

问题

研究

代码的语法

我的输出

期望的输出

到目前为止我的代码

代码：

输出：

注：