从一串 Python 代码（正则表达式或 AST）中提取所有变量

Question

我想查找并提取包含 Python 代码的字符串中的所有变量。我只想提取变量（和带下标的变量）而不是函数调用。

例如，来自以下字符串：

code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'

我要提取：foo、bar[1]、baz[1:10:var1[2+1]]、var1[2+1]、qux[[1,2,int(var2)]]、var2、bob[len("foobar")], var3[0].请注意，某些变量可能是 "nested"。例如，从 baz[1:10:var1[2+1]] 我想提取 baz[1:10:var1[2+1]] 和 var1[2+1].

首先想到的两个想法是使用正则表达式或 AST。我都试过了，但都没有成功。

当使用正则表达式时，为了使事情更简单，我认为首先提取 "top level" 变量，然后递归嵌套变量是个好主意。不幸的是，我连那个都做不到。

这是我目前拥有的：

regex = r'[_a-zA-Z]\w*\s*(\[.*\])?'
for match in re.finditer(regex, code):
    print(match)

这是一个演示：https://regex101.com/r/INPRdN/2

另一个解决方案是使用 AST，扩展 ast.NodeVisitor，并实现 visit_Name 和 visit_Subscript 方法。但是，这也不起作用，因为 visit_Name 也被函数调用。

如果有人能为我提供这个问题的解决方案（正则表达式或 AST），我将不胜感激。

谢谢。

Answer 1

Regex 不是一个足够强大的工具来执行此操作。如果您的嵌套深度有限，则有一些棘手的解决方法可以让您制作复杂的正则表达式来完成您正在寻找的事情，但我不推荐它。

This is question is asked a lot an the linked response is famous for demonstrating the difficulty of what you are trying to do

如果您真的必须为代码解析一个字符串，AST 在技术上是可行的，但我不知道有什么库可以帮助解决这个问题。你最好尝试构建一个递归函数来进行解析。

Answer 2

我发现你的问题是一个有趣的挑战，所以这里有一个代码可以做你想做的事情，单独使用 Regex 做这件事是不可能的，因为有嵌套的表达式，这是一个使用组合的解决方案 Regex 和字符串操作来处理嵌套表达式：

# -*- coding: utf-8 -*-
import re
RE_IDENTIFIER = r'\b[a-z]\w*\b(?!\s*[\[\("\'])'
RE_INDEX_ONLY = re.compile(r'(##)(\d+)(##)')
RE_INDEX = re.compile('##\d+##')


def extract_expression(string):
    """ extract all identifier and getitem expression in the given order."""

    def remove_brackets(text):
        # 1. handle `[...]` expression replace them with #{#...#}#
        # so we don't confuse them with word[...]
        pattern = '(?<!\w)(\s*)(\[)([^\[]+?)(\])'
        # keep extracting expression until there is no expression
        while re.search(pattern, text):
            text = re.sub(pattern, r'#{##}#', string)
        return text

    def get_ordered_subexp(exp):
        """ get index of nested expression."""
        index = int(exp.replace('#', ''))
        subexp = RE_INDEX.findall(expressions[index])
        if not subexp:
            return exp
        return exp + ''.join(get_ordered_subexp(i) for i in subexp)

    def replace_expression(match):
        """ save the expression in the list, replace it with special key and it's index in the list."""
        match_exp = match.group(0)
        current_index = len(expressions)
        expressions.append(None)  # just to make sure the expression is inserted before it's inner identifier
        # if the expression contains identifier extract too.
        if re.search(RE_IDENTIFIER, match_exp) and '[' in match_exp:
            match_exp = re.sub(RE_IDENTIFIER, replace_expression, match_exp)
        expressions[current_index] = match_exp
        return '##{}##'.format(current_index)

    def fix_expression(match):
        """ replace the match by the corresponding expression using the index"""
        return expressions[int(match.group(2))]

    # result that will contains
    expressions = []

    string = remove_brackets(string)

    # 2. extract all expression and keep track of there place in the original code
    pattern = r'\w+\s*\[[^\[]+?\]|{}'.format(RE_IDENTIFIER)
    # keep extracting expression until there is no expression
    while re.search(pattern, string):
        # every exression that is extracted is replaced by a special key
        string = re.sub(pattern, replace_expression, string)
        # some times inside brackets can contains getitem expression
        # so when we extract that expression we handle the brackets
        string = remove_brackets(string)

    # 3. build the correct result with extracted expressions
    result = [None] * len(expressions)
    for index, exp in enumerate(expressions):
        # keep replacing special keys with the correct expression
        while RE_INDEX_ONLY.search(exp):
            exp = RE_INDEX_ONLY.sub(fix_expression, exp)
        # finally we don't forget about the brackets
        result[index] = exp.replace('#{#', '[').replace('#}#', ']')

    # 4. Order the index that where extracted
    ordered_index = ''.join(get_ordered_subexp(exp) for exp in RE_INDEX.findall(string))
    # convert it to integer
    ordered_index = [int(index[1]) for index in RE_INDEX_ONLY.findall(ordered_index)]

    # 5. fix the order of expressions using the ordered indexes
    final_result = []
    for exp_index in ordered_index:
        final_result.append(result[exp_index])

    # for debug:
    # print('final string:', string)
    # print('expression :', expressions)
    # print('order_of_expresion: ', ordered_index)
    return final_result


code = 'foo + bar[1] + baz[1:10:var1[2+1]] + qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 (var3[0])'
code2 = 'baz[1:10:var1[2+1]]'
code3 = 'baz[[1]:10:var1[2+1]:[var3[3+1*x]]]'
print(extract_expression(code))
print(extract_expression(code2))
print(extract_expression(code3))

输出：

['foo', 'bar[1]', 'baz[1:10:var1[2+1]]', 'var1[2+1]', 'qux[[1,2,int(var2)]]', 'var2', 'bob[len("foobar")]', 'var3[0]']
['baz[1:10:var1[2+1]]', 'var1[2+1]']
['baz[[1]:10:var1[2+1]:[var3[3+1*x]]]', 'var1[2+1]', 'var3[3+1*x]', 'x']

我针对非常复杂的示例测试了这段代码，它运行良好。并注意提取的顺序与您想要的相同，希望这是您需要的。

Answer 3

这个答案可能为时已晚。但是可以使用 python 正则表达式包来做到这一点。

import regex
code= '''foo + bar[1] + baz[1:10:var1[2+1]] + 
qux[[1,2,int(var2)]] + bob[len("foobar")] + func() + func2 
(var3[0])'''
p=r'(\b[a-z]\w*\b(?!\s*[\(\"])(\[(?:[^\[\]]|(?2))*\])?)' 
result=regex.findall(p,code,overlapped=True) #overlapped=True is needed to capture something inside a group like  'var1[2+1]'
[x[0] for x in result] #result variable is list of tuple of two,as each pattern capture two groups ,see below.

输出：
['foo','bar[1]','baz[1:10:var1[2+1]]','var1[2+1]','qux[[1,2,int(var2)]]','var2','bob[len("foobar")]','var3[0]']

图案说明：

( # 第一个捕获组开始
\b[a-z]\w*\b #变量名，例如'bar'
(?!\s*[\(\"]) #negative lookahead。所以要忽略像 foobar 这样的东西
(\[(?:[^\[\]]|(?2))*\]) #2nd capture group, capture nested groups in '[ ]'
#eg '[1:10:var1[2+1]]'.
#'?2' 递归引用第二个捕获组
？ #2nd 捕获组是可选的，因此可以捕获类似 'foo'
) #第一组结束。

从一串 Python 代码（正则表达式或 AST）中提取所有变量

Extract all variables from a string of Python code (regex or AST)

python

regex

abstract-syntax-tree