处理嵌套字典列表的最 Pythonic 方式是什么?

What's the most Pythonic way to deal with a list of nested dictionaries?

识别不同嵌套字典类型的最 Pythonic 方法是什么 API returns 以便可以应用正确的解析类型?

我正在从 Reddit 进行 API 调用以获取 URL,并且我正在获取具有不同键名和不同嵌套字典结构的嵌套字典。
我正在提取我需要的 URL,但我需要一种更 Pythonic 的方式来识别嵌套字典的不同键名和不同结构,因为 if 我在一个 for 循环中尝试过的语句 运行 进入错误,因为“如果”字典不包含键我只是从 if 语句“询问”所述键是否在字典中得到一个 NoneType 错误。

在更多的段落中,我描述了这个问题,但您可以深入研究字典示例和我下面的代码,并看到我无法一次识别三种类型的字典中的一种的问题。嵌套字典没有相同的结构,我的代码充满了 trys 和我认为冗余的 for 循环。

我有一个函数可以处理三种类型的嵌套字典。 topics_data(在下面使用)是一个 Pandas 数据框,列 vidtopics_data 中包含嵌套字典的列名。有时 vid 单元格中的对象是 None 如果我正在阅读的 post 不是视频 post.

嵌套字典只有三种主要类型 API returns(如果没有 None)。我最大的问题是如果我尝试使用 if 语句捕获嵌套字典,其中键 reddit_video 以另一个键(例如 [=31)开头,则在没有出现 NoneType 错误的情况下识别第一个键名=] 代替。由于这个问题,我为三种嵌套字典类型中的每一种迭代了嵌套字典列表三次。我希望能够遍历嵌套字典列表一次,并一次性识别和处理每种类型的嵌套字典。

下面是我得到的三种不同类型的嵌套字典的示例,以及我现在设置的用于处理它们的丑陋代码。我的代码有效,但它很难看。请挖进去看看。

嵌套字典...

嵌套字典一

{'reddit_video': {'fallback_url': 'https://v.redd.it/te7wsphl85121/DASH_2_4_M?source=fallback',
  'height': 480,
  'width': 480,
  'scrubber_media_url': 'https://v.redd.it/te7wsphl85121/DASH_600_K',
  'dash_url': 'https://v.redd.it/te7wsphl85121/DASHPlaylist.mpd?a=1604490293%2CYmQzNDllMmQ4MDVhMGZhODMyYmIxNDc4NTZmYWNlNzE2Nzc3ZGJjMmMzZGJjMmYxMjRiMjJiNDU4NGEzYzI4Yg%3D%3D&v=1&f=sd',
  'duration': 17,
  'hls_url': 'https://v.redd.it/te7wsphl85121/HLSPlaylist.m3u8?a=1604490293%2COTg2YmIxZmVmZGNlYTVjMmFiYjhkMzk5NDRlNWI0ZTY4OGE1NzgxNzUyMDhkYjFiNWYzN2IxYWNkZjM3ZDU2YQ%3D%3D&v=1&f=sd',
  'is_gif': False,
  'transcoding_status': 'completed'}}

嵌套字典二

{'type': 'gfycat.com',
 'oembed': {'provider_url': 'https://gfycat.com',
  'description': 'Hi! We use cookies and similar technologies ("cookies"), including third-party cookies, on this website to help operate and improve your experience on our site, monitor our site performance, and for advertising purposes. By clicking "Accept Cookies" below, you are giving us consent to use cookies (except consent is not required for cookies necessary to run our site).',
  'title': 'Protestors in Hong Kong are cutting down facial recognition towers.',
  'type': 'video',
  'author_name': 'Gfycat',
  'height': 600,
  'width': 600,
  'html': '<iframe class="embedly-embed" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2Fedibleunrulyargentineruddyduck&display_name=Gfycat&url=https%3A%2F%2Fgfycat.com%2Fedibleunrulyargentineruddyduck-hong-kong-protest&image=https%3A%2F%2Fthumbs.gfycat.com%2FEdibleUnrulyArgentineruddyduck-size_restricted.gif&key=ed8fa8699ce04833838e66ce79ba05f1&type=text%2Fhtml&schema=gfycat" width="600" height="600" scrolling="no" title="Gfycat embed" frameborder="0" allow="autoplay; fullscreen" allowfullscreen="true"></iframe>',
  'thumbnail_width': 280,
  'version': '1.0',
  'provider_name': 'Gfycat',
  'thumbnail_url': 'https://thumbs.gfycat.com/EdibleUnrulyArgentineruddyduck-size_restricted.gif',
  'thumbnail_height': 280}}

嵌套字典三

{'oembed': {'provider_url': 'https://gfycat.com',
  'description': 'Hi! We use cookies and similar technologies ("cookies"), including third-party cookies, on this website to help operate and improve your experience on our site, monitor our site performance, and for advertising purposes. By clicking "Accept Cookies" below, you are giving us consent to use cookies (except consent is not required for cookies necessary to run our site).',
  'title': 'STRAYA! Ski-roos.   Stephan Grenfell for Australian Geographic',
  'author_name': 'Gfycat',
  'height': 338,
  'width': 600,
  'html': '<iframe class="embedly-embed" src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgfycat.com%2Fifr%2Fhairyvibrantamericanratsnake&display_name=Gfycat&url=https%3A%2F%2Fgfycat.com%2Fhairyvibrantamericanratsnake-snow-kangaroos&image=https%3A%2F%2Fthumbs.gfycat.com%2FHairyVibrantAmericanratsnake-size_restricted.gif&key=ed8fa8699ce04833838e66ce79ba05f1&type=text%2Fhtml&schema=gfycat" width="600" height="338" scrolling="no" title="Gfycat embed" frameborder="0" allow="autoplay; fullscreen" allowfullscreen="true"></iframe>',
  'thumbnail_width': 444,
  'version': '1.0',
  'provider_name': 'Gfycat',
  'thumbnail_url': 'https://thumbs.gfycat.com/HairyVibrantAmericanratsnake-size_restricted.gif',
  'type': 'video',
  'thumbnail_height': 250},
 'type': 'gfycat.com'}  

我的函数处理这三种嵌套字典。 topics_data 是一个 Pandas 数据框,列 vidtopics_data 中包含嵌套字典的列名,或者是 None.

def download_vid(topics_data, ydl_opts):
    for i in topics_data['vid']:
        try:
            if i['reddit_video']:
                B = i['reddit_video']['fallback_url']
                with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                    ydl.download([B])

                print(B)
        except:
            pass
    for n, i in enumerate(topics_data['vid']):
        try:
            if i['type'] == 'gfycat.com':
                C = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                C = 'https://giant.gfycat.com/'+ C +'.mp4'
                sub = str(topics_data.loc[n]['subreddit']).lower()
                urllib.request.urlretrieve(C,
                                           '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+C.split('/')[-1:][0])

                print(C)
        except:
            pass
    for i in topics_data['vid']:
        try:
            if i['oembed']['thumbnail_url']:
                D = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                D = 'https://giant.gfycat.com/'+ D +'.mp4'
                sub = str(topics_data.loc[n]['subreddit']).lower()
            urllib.request.urlretrieve(D, '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+D.split('/')[-1:][0])
                print(D)
        except:
            pass  

写完这段代码后,我发现 if 语句是多余的,因为它会 try 并成功解析 topics_data.loc[n]['vid']['oembed'] 是否可能在每个 try块。
不要纠结于如何解析嵌套字典,因为那不是我的问题。我的问题主要是如何识别迭代器具有哪种类型的嵌套字典。我假设这一切都可以在一个 for 循环而不是三个循环中处理。
最后一个问题是偶尔会有第四种、第五种或第六种类型的字典我对解析不感兴趣,因为它们太罕见了。

这最后一段代码可能不是必需的,但我添加它只是为了使问题完整。我的识别和解析词典的函数也接受了 youtube-dl 的参数。

def my_hook(d):
    if d['status'] == 'finished':
        print('Done downloading, now converting ...')

def yt_dl_opts(topics_data):
    ydl_opts = {
        'format': 'bestvideo+bestaudio/37/22/18/best',
        'merge': 'mp4',
        'noplaylist' : True,        
        'progress_hooks': [my_hook],
        'outtmpl' : '/media/iii/Q2/tor/Reddit/Subs/'+ str(topics_data.loc[0]['subreddit']).lower()+'/%(id)s'
    }
    return ydl_opts  

更新
这是在 Neil 的帮助下对问题的回答。只是为了让后代的问答更清楚。
一切都仍然包裹在 try: except: pass 中,因为仍然有一些随机的,并且总是返回新的 dic 结构。我写了一个循环来计算不是 None 的视频结果,并计算所有使用 os.walk.

成功下载的视频
def download_vid(topics_data, ydl_opts):
    y_base = 'https://www.youtube.com/watch?v='
    for n, i in enumerate(topics_data['vid']):
        try:
            if 'type' in i:
                if 'youtube.com' in i[n]['type']:
                    print('This is a Youtube Video')
                    A = i['oembed']['html'].split('embed/')[1].split('?')[0]
                    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                        ydl.download([A])
                    print(y_base+A)

            if 'reddit_video' in i:
                print('This is a reddit_video Video')
                B = i['reddit_video']['fallback_url']
                with youtube_dl.YoutubeDL(ydl_opts) as ydl:
                    ydl.download([B])
                print(B)

            if 'type' in i:
                if 'gfycat.com' in i[n]['type']:
                    print('This is a type, gfycat Video')
                    C = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                    C = 'https://giant.gfycat.com/'+ C +'.mp4'
                    sub = str(topics_data.loc[n]['subreddit']).lower()
                    urllib.request.urlretrieve(C,
                                       '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+C.split('/')[-1:][0])
                print(C)

            if 'oembed' in i:
                print('This is a oembed, gfycat Video')
                D = topics_data.loc[n]['vid']['oembed']['thumbnail_url'].split('/')[-1:][0].split('-')[0]
                D = 'https://giant.gfycat.com/'+ D +'.mp4'
                sub = str(topics_data.loc[n]['subreddit']).lower()
                urllib.request.urlretrieve(C, '/media/iii/Q2/tor/Reddit/Subs/'+sub+'/'+D.split('/')[-1:][0])
                print(D)
        except:
            pass

更新: 意识到 OP 的文本正在处理 non-unique 查找。添加了一段描述如何做到这一点。

如果您发现自己多次循环遍历字典列表以执行查找,请将列表重组为字典,以便查找成为一个键。例如:

a = [{"id": 1, "value": "foo"}, {"id": 2, "value": "bar"}]
for item in a:
    if item["id"] == 1:
        print(item["value"])

可以变成这样:

a = [{"id": 1, "value": "foo"}, {"id": 2, "value": "bar"}]
a = {item["id"]: item for item in a} # index by lookup field

print(a[1]["value"]) # no loop
... # Now we can continue to loopup by id eg a[2] without a loop

如果是 non-unique 查找,您可以执行类似的操作:

indexed = {}
a = [{"category": 1, "value": "foo"}, {"category": 2, "value": "bar"}, {"category": 1, "value": "baz"}]
for item in a: # This loop only has to be executed once
    if indexed.get(item["category"], None) is not None:
        indexed[item["category"]].append(item)
    else:
        indexed[item["category"]] = [item]

# Now we can do:
all_category_1_data = indexed[1]
all_category_2_data = indexed[2]

如果出现索引错误,使用默认字典索引更容易处理

if a.get(1, None) is not None:
    print(a[1]["value"])
else:
    print("1 was not in the dictionary")

这个 IMO 没有任何“Pythonic”,但如果 API 正在返回您需要循环的列表,它可能只是设计不当 API

更新:好的,我会尝试修复您的代码:

def download_vid(topics_data, ydl_opts):
    indexed_data = {'reddit': [], 'gfycat': [], 'thumbnail': []}

    for item in topics_data['vid']:
        if item.get('reddit_video', None) is not None:
            indexed_data['reddit'].append(item)
        elif item.get('type', None) == "gfycat.com":
            indexed_data['gfycat'].append(item)
        elif item.get('oembed', None) is not None:
            if item['oembed'].get('thumbnail_url', None) is not None:
                indexed_data['thumbnail'].append(item)

    for k, v in indexed_data.items():
        assert k in ('reddit_video', 'gfycat', 'thumbnail')
        if k == 'reddit_video':
            B = v['reddit_video']['fallback_rul']
            ...
        elif k == 'gfycat':
            C = v['oembed']['thumbnail_url']
            ...
        elif k == 'thumbnail':
            D = v['oembed']['thumbnail_url']
            ...

以防万一不清楚为什么这样更好:

  • OP 循环了 topics_data['vid'] 三次。我做了两次。

  • 更重要的是,如果再增加题目,我还是只做两次。 OP 将不得不再次循环。

  • 无异常处理。

  • 每组对象现在都已编入索引。所以 OP 可以做,例如 indexed_data['gfycat'] 来获取所有这些对象,如果需要的话,这是一个散列 table 查找,所以它的速度很快