使用正则表达式检查数据集是否存在,而无需先读取所有数据集的路径

Check if datasets exists, using a regex, without first reading the paths of all datasets

如何在不先读取所有数据集路径的情况下使用正则表达式之类的方法检查数据集是否存在?

例如,我想检查数据集 'completed' 是否存在于可能(或可能不)包含

的文件中
/123/completed

(假设我事先不知道完整路径,我只想检查数据集名称。所以 this answer 在我的情况下不起作用。)

自定义递归

不需要正则表达式。您可以通过递归遍历 HDF5 文件中的组来构建 set 数据集名称:

import h5py

def traverse_datasets(hdf_file):

    """Traverse all datasets across all groups in HDF5 file."""

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = '{}/{}'.format(prefix, key)
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    with h5py.File(hdf_file, 'r') as f:
        for (path, dset) in h5py_dataset_iterator(f):
            yield path.split('/')[-1]

all_datasets = set(traverse_datasets('file.h5'))

然后只需检查成员资格:'completed' in all_datasets

Group.visit

或者,您可以使用 Group.visit。请注意,您需要 return None 的搜索功能来迭代所有组。

res = []

def searcher(name, k='completed'):
    """ Find all objects with k anywhere in the name """
    if k in name:
        res.append(name)
        return None

with h5py.File('file.h5', 'r') as f:
    group = f['/']
    group.visit(searcher)

print(res)  # print list of dataset names matching criterion

两种情况下的复杂度都是 O(n)。您需要测试每个数据集的名称,但仅此而已。如果您需要一个懒惰的解决方案,第一个选项可能更可取。

递归查找数据集的所有有效路径

以下代码使用递归查找所有数据集的有效数据路径。在获得有效路径(重复 3 次后终止可能的循环引用)后,我可以对返回的列表(未显示)使用正则表达式。

import numpy as np
import h5py
import collections
import warnings


def visit_data_sets(group, max_len_check=20, max_repeats=3):
    # print(group.name)
    # print(list(group.items()))

    if len(group.name) > max_len_check:
        # this section terminates a circular reference after 4 repeats. However it  will
        # incorrectly terminate  a tree if the identical repetitive sequences of names are
        # actually used in the tree.
        name_list = group.name.split('/')
        current_name = name_list[-1]
        res_list = [i for i in range(len(name_list)) if name_list[i] == current_name]
        res_deq = collections.deque(res_list)
        res_deq.rotate(1)
        res_deq2 = collections.deque(res_list)
        diff = [res_deq2[i] - res_deq[i] for i in range(0, len(res_deq))]

        if len(diff) >= max_repeats:
            if diff[-1] == diff[-2]:
                message = 'Terminating likely circular reference "{}"'.format(group.name)
                warnings.warn(message, UserWarning)
                print()
                return []

    dataset_list = list()
    for key, value in group.items():
        if isinstance(value, h5py.Dataset):
            current_path = group.name + '/{}'.format(key)
            dataset_list.append(current_path)
        elif isinstance(value, h5py.Group):
            dataset_list += visit_data_sets(value)

        else:
            print('Unhandled class name {}'.format(value.__class__.__name__))

    return dataset_list

def visit_callback(name, object):
    print('Visiting name = "{}", object name = "{}"'.format(name, object.name))
    return None

hdf_fptr = h5py.File('link_test.hdf5', mode='w')

group1 = hdf_fptr.require_group('/junk/group1')
group1a = hdf_fptr.require_group('/junk/group1/group1a')
# group1a1 = hdf_fptr.require_group('/junk/group1/group1a/group1ai')
group2 = hdf_fptr.require_group('/junk/group2')
group3 = hdf_fptr.require_group('/junk/group3')

# create a circular reference
group1ai = group1a['group1ai'] = group1


avect = np.arange(0,12.3, 1.0)

dset = group1.create_dataset('avect', data=avect)

group2['alias'] = dset
group3['alias3'] = h5py.SoftLink(dset.name)


print('\nThis demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"')
print('Visiting Root - {}'.format(hdf_fptr.name))
hdf_fptr.visititems(visit_callback)

print('\nThis demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"')
print('Visiting Group - {}'.format(group2.name))
group2.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"')
print('Visiting Group - {}'.format(group3.name))
group3.visititems(visit_callback)


print('\n\nNow demonstrate recursive visit of Root looking for datasets')
print('using the function "visit_data_sets" in this code snippet.\n')
data_paths = visit_data_sets(hdf_fptr)

for data_path in data_paths:
    print('Data Path = "{}"'.format(data_path))

hdf_fptr.close()

以下输出显示了 "visititems" 在确定所有有效路径时如何工作,或者对于我的目的如何工作失败,同时递归满足我的需求,也可能满足您的需求。

This demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"
Visiting Root - /
Visiting name = "junk", object name = "/junk"
Visiting name = "junk/group1", object name = "/junk/group1"
Visiting name = "junk/group1/avect", object name = "/junk/group1/avect"
Visiting name = "junk/group1/group1a", object name = "/junk/group1/group1a"
Visiting name = "junk/group2", object name = "/junk/group2"
Visiting name = "junk/group3", object name = "/junk/group3"

This demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"
Visiting Group - /junk/group2
Visiting name = "alias", object name = "/junk/group2/alias"

This demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"
Visiting Group - /junk/group3


Now demonstrate recursive visit of Root looking for datasets
using the function "visit_data_sets" in this code snippet.

link_ref_test.py:26: UserWarning: Terminating likely circular reference "/junk/group1/group1a/group1ai/group1a/group1ai/group1a"

  warnings.warn(message, UserWarning)
Data Path = "/junk/group1/avect"
Data Path = "/junk/group1/group1a/group1ai/avect"
Data Path = "/junk/group1/group1a/group1ai/group1a/group1ai/avect"
Data Path = "/junk/group2/alias"
Data Path = "/junk/group3/alias3"

第一个 "Data Path" 结果是原始数据集。第二个和第三个是由循环引用引起的对原始数据集的引用。第四个结果是硬 Link,第五个是软 Link 原始数据集。