在包含子字符串的字典中递归搜索路径

Recursively search for paths in a dictionary containing a sub-string

我正在尝试确定使用正则表达式搜索嵌套字典的最快方法,以及 return 每次出现该字符串的路径。我只对字符串值感兴趣,对其他可能不是明确为字符串的值不感兴趣。递归不是我的强项。这是一个示例 JSON,假设我正在寻找包含 'blah'.

的所有绝对路径
d = {'id': 'abcde',
 'key1': 'blah',
 'key2': 'blah blah',
 'nestedlist': [{'id': 'qwerty',
   'nestednestedlist': [{'id': 'xyz', 'keyA': 'blah blah blah'},
    {'id': 'fghi', 'keyZ': 'blah blah blah'}],
   'anothernestednestedlist': [{'id': 'asdf', 'keyQ': 'blah blah'},
    {'id': 'yuiop', 'keyW': 'blah'}]}]}

我找到了以下代码片段,但未能将其设为 return 路径,而不仅仅是打印它们。除此之外,添加“如果值是一个字符串并且包含 re.search() 然后将路径附加到列表”应该不会太难。

def search_dict(v, prefix=''):
    
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            search_dict(v2, p2)
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            search_dict(v2, p2)
    else:
        print('{} = {}'.format(prefix, repr(v)))

您只需要初始化一个输出列表,append 个在当前调用中找到的元素,extend 它由递归调用返回的结果。

试试这个:

def search_dict(v, prefix=''):
    result = []
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            result.extend(search_dict(v2, p2))
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            result.extend(search_dict(v2, p2))
    else:
        result.append('{} = {}'.format(prefix, repr(v)))
    return result

Adam.Er8 是准确的,我只是想更明确地回答这个问题:

def search_dict(v, re_term, prefix=''):

    re_term = re.compile(re_term)
    result = []
    if isinstance(v, dict):
        for k, v2 in v.items():
            p2 = "{}['{}']".format(prefix, k)
            result.extend(search_dict(v2, re_term, prefix = p2))
    elif isinstance(v, list):
        for i, v2 in enumerate(v):
            p2 = "{}[{}]".format(prefix, i)
            result.extend(search_dict(v2, re_term, prefix = p2))
    elif isinstance(v, str) and re.search(re_term,v):
        result.append(prefix)
    return result

这里的两个答案都急切地计算结果,在返回第一个(如果有的话)可用结果之前耗尽整个输入字典。我们可以使用 yield from 来编码更 Pythonic 的程序 -

def search_substr(t = {}, q = ""):
  def loop(t, path):
    if isinstance(t, dict):
      for k, v in t.items():
        yield from loop(v, (*path, k))  # <- recur
    elif isinstance(t, list):
      for k, v in enumerate(t):
        yield from loop(v, (*path, k))  # <- recur
    elif isinstance(t, str):
      if q in t:
        yield path, t                   # <- output a match
  yield from loop(t, ())                # <- init

for (path, value) in search_substr(d, "blah"):
  print(path, value)

结果-

('key1',) blah
('key2',) blah blah
('nestedlist', 0, 'nestednestedlist', 0, 'keyA') blah blah blah
('nestedlist', 0, 'nestednestedlist', 1, 'keyZ') blah blah blah
('nestedlist', 0, 'anothernestednestedlist', 0, 'keyQ') blah blah
('nestedlist', 0, 'anothernestednestedlist', 1, 'keyW') blah

注意,我们使用 q in t 测试目标 t 中的子字符串 q。如果你真的想为此使用正则表达式 -

from re import compile

def search_re(t = {}, q = ""):
  def loop(t, re, path):                      # <- add re
    if isinstance(t, dict):
      for k, v in t.items():
        yield from loop(v, re, (*path, k))    # <- carry re
    elif isinstance(t, list):
      for k, v in enumerate(t):
        yield from loop(v, re, (*path, k))    # <- carry re
    elif isinstance(t, str):
      if re.search(t):                        # <- re.search
        yield path, t
  yield from loop(t, compile(q), ())          # <- compile q

现在我们可以使用正则表达式进行搜索 -

for (path, value) in search_re(d, r"[abhl]{4}"):
  print(path, value)

结果-

('key1',) blah
('key2',) blah blah
('nestedlist', 0, 'nestednestedlist', 0, 'keyA') blah blah blah
('nestedlist', 0, 'nestednestedlist', 1, 'keyZ') blah blah blah
('nestedlist', 0, 'anothernestednestedlist', 0, 'keyQ') blah blah
('nestedlist', 0, 'anothernestednestedlist', 1, 'keyW') blah

让我们使用不同的查询尝试另一个搜索 -

for (path, value) in search_re(d, r"[dfs]{3}"):
  print(path, value)
('nestedlist', 0, 'anothernestednestedlist', 0, 'id') asdf

最后,当查询不匹配时,search_substrsearch_re 什么也不产生 -

print(list(search_re(d, r"zzz")))
# []