考虑缩进级别递归使用正则表达式

Question

我正在尝试为我正在编写的模拟代码解析自定义输入文件。它由带有属性、值的嵌套 "objects" 组成（参见 link）。

这里是an example file and the regex I am using currently。

([^:#\n]*):?([^#\n]*)#?.*\n

每场比赛都是一行，有两个捕获组，一个用于属性，一个用于它的值。它还从字符集中排除了“#”和“:”，因为它们分别对应于注释分隔符和 property:value 分隔符。

如何修改正则表达式以递归匹配结构？即如果第n+1行的识别级别高于第n行，则应作为第n行匹配的子组进行匹配。

我正在研究 Octave，它使用 PCRE 正则表达式格式。

Answer 1

我问你是否可以控制数据格式，因为实际上，使用 YAML 而不是 regex 很容易解析数据。

唯一的问题是对象格式不正确：

1) 以regions对象为例，它有很多属性，都叫layer。我认为您的意图是构建一个 layer 的列表，而不是许多同名的属性。

2) 现在考虑每个 layer 属性都有对应的值。每个 layer 之后是我认为属于每一层的孤立属性。

考虑到这些想法。如果您按照 YAML 规则形成您的对象，那么解析它会变得轻而易举。

我知道您在 Octave 中工作，但请考虑我对您的数据所做的修改，以及解析它的容易程度，在本例中使用 python。

您现在拥有的数据

case    : 
    name    : tandem solar cell
    options :
        verbose : true
        t_stamp : system
    units   :
        energy  : eV
        length  : nm
        time    : s
        tension : V
        temperature: K
        mqty    : mole
        light   : cd
    regions :
        layer   : Glass
            geometry:
                thick   : 80 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt
        layer   : FTO
            geometry:
                thick   : 10 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt

修改数据以符合 YAML 语法

case    : 
    name    : tandem solar cell
    options :
        verbose : true
        t_stamp : system # a sample comment
    units   :
        energy  : eV
        length  : nm
        time    : s
        tension : V
        temperature: K
        mqty    : mole
        light   : cd
    regions : 
        -   layer   : Glass # ADDED THE - TO MAKE IT A LIST OF LAYERS
            geometry :      # AND KEEP INDENTATION PROPERLY
                thick   : 80 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt
        -   layer   : FTO
            geometry:
                thick   : 10 nm
                npoints : 10
            optical :
                nk_file : vacuum.txt

仅使用这些指令，您就可以解析对象：

import yaml
data = yaml.load(text)

""" your data would be parsed as:
{'case': {'name': 'tandem solar cell',
          'options': {'t_stamp': 'system', 'verbose': True},
          'regions': [{'geometry': {'npoints': 10, 'thick': '80 nm'},
                       'layer': 'Glass',
                       'optical': {'nk_file': 'vacuum.txt'}},
                      {'geometry': {'npoints': 10, 'thick': '10 nm'},
                       'layer': 'FTO',
                       'optical': {'nk_file': 'vacuum.txt'}}],
          'units': {'energy': 'eV',
                    'length': 'nm',
                    'light': 'cd',
                    'mqty': 'mole',
                    'temperature': 'K',
                    'tension': 'V',
                    'time': 's'}}}

"""

考虑缩进级别递归使用正则表达式

Use regex recursively taking indentation level into account

regex

recursion

indentation