在 python 中使用正则表达式在 C 中获取函数内容和函数名称

Question

如果函数名称与定义的模式匹配，我将尝试获取函数内容（正文）

到目前为止我尝试了什么：

(Step1) get with a recursion all function bodies in a define C file {(?:[^{}]+|(?R))*+}

(Step2) find all matches of wanted function' s name

(Step3) Combine both steps. This where I am struggling

输入

TASK(arg1)
{
    if (cond)
    {
      /* Comment */
      function_call();
      if(condIsTrue)
      {
         DoSomethingelse();
      }
    }
    if (cond1)
    {
      /* Comment */
      function_call1();
    }
}


void FunctionIDoNotWant(void)
{
    if (cond)
    {
      /* Comment */
      function_call();
    }
    if (cond1)
    {
      /* Comment */
      function_call1();
    }
}

我正在寻找函数 TASK。当我添加正则表达式以匹配“{(?:[^{}]+|(?R))*+}”前面的 TASK 时，没有任何效果。

(TASK\s*\(.*?\)\s)({((?>[^{}]+|(?R))*)})

期望的输出

Group1:
   TASK(arg1)
Group2:
    if (cond)
    {
      /* Comment */
      function_call();
      if(condIsTrue)
      {
         DoSomethingelse();
      }
    }
    if (cond1)
    {
      /* Comment */
      function_call1();
    }

Answer 1

这个问题有点复杂，可能取决于我们的输入，并且可能部分用正则表达式解决，部分用脚本解决，例如，我们将从一个传递换行符的表达式开始，例如：

(TASK.+)\s*({[\s\S]*})\s*void
(TASK.+)\s*({[\w\W]*})\s*void
(TASK.+)\s*({[\d\D]*})\s*void

这里我们有一个开始边界，这是我们第一个想要的输出：

(TASK.+)

以及我们第二个期望输出的左右边界：

\s*({[\s\S]*})\s*void

右边界可能会改变：

\s*void

Demo

正则表达式

如果不需要此表达式并且您希望对其进行修改，请访问 link regex101.com。

正则表达式电路

jex.im 可视化正则表达式：

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(TASK.+)\s*({[\s\S]*})\s*void"

test_str = ("TASK(arg1)\n"
    "{\n"
    "    if (cond)\n"
    "    {\n"
    "      /* Comment */\n"
    "      function_call();\n"
    "      if(condIsTrue)\n"
    "      {\n"
    "         DoSomethingelse();\n"
    "      }\n"
    "    }\n"
    "    if (cond1)\n"
    "    {\n"
    "      /* Comment */\n"
    "      function_call1();\n"
    "    }\n"
    "}\n\n\n"
    "void FunctionIDoNotWant(void)\n"
    "{\n"
    "    if (cond)\n"
    "    {\n"
    "      /* Comment */\n"
    "      function_call();\n"
    "    }\n"
    "    if (cond1)\n"
    "    {\n"
    "      /* Comment */\n"
    "      function_call1();\n"
    "    }\n"
    "}")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Answer 2

这不能单独使用正则表达式来完成 - 正则表达式不能计算打开（和关闭）的括号 ({ })。至少不是没有一些奇怪的扩展。

尝试此代码（假设开始是您要查找的函数头之后的第一个字符）：

i = start + 1
c = 1
r = re.compile('[{]|[}]')
while c > 0:
    m = r.search(test_str, i)
    if not m:
        break
    if m.group(0) == '{':
        c += 1
    else:
        c -= 1
    i = m.end(0) + 1
if c == 0:
    print(test_str[start:i])

它的作用是在您要查找的函数头之后开始迭代您的源代码，并计算打开的 ({) 和关闭的 (}) 括号。请注意，该宏也可以引入这些括号 - 在这种情况下，您可能不得不强制编译器在宏替换后生成源代码，这取决于编译器。

Answer 3

您正在使用 (?R) 递归整个模式，这与 (?0) 相同，而您想要 recurse (?2), the second group。第一组包含您的 (TASK...)

See this demo at regex101

(TASK\s*\(.*?\)\s)({((?>[^{}]+|(?2))*)})
                  ^ here starts the second group -> recursion with (?2)

在 python 中使用正则表达式在 C 中获取函数内容和函数名称

Getting function Content and function name in C with regular expression in python

python

regex

regex-group

regex-greedy

输入

期望的输出

Demo

正则表达式

正则表达式电路

测试