HDF5 到 CSV 转换期间的多个错误

Question

我有一个巨大的 h5 文件，我需要将每个数据集提取到一个单独的 csv 文件中。该模式类似于 /Genotypes/GroupN/SubGroupN/calls，带有 'N' 组和 'N' 子组。我已经创建了与主文件具有相同结构的示例 h5 文件并测试了可以正常工作的代码，但是当我在我的主 h5 文件上应用代码时它遇到了各种错误。 HDF5 文件的模式：

/Genotypes
    /genotype a
        /genotype a_1 #one subgroup for each genotype group
            /calls #data that I need to extract to csv file
            depth #data
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
            /calls #data
            depth #data
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1
            /calls 
            depth

/Positions
    /allel #data 
    chromo #data#
/Taxa 
    /genotype a
        /genotype a_1
    /genotype b
        /genotype b_1 #one subgroup for each genotype group
    .
    .
    .
    /genotype n #1500 genotypes are listed as groups
        /genotype n_1

/_Data-Types_
    Enum_Boolean
    String_VariableLength

这是创建示例 h5 文件的代码：

import h5py  
import numpy as np  
    ngrps = 2  
    nsgrps = 3  
    nds = 4  
    nrows = 10  
    ncols = 2  
    
    i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )  
    with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :  
        for gcnt in range(ngrps):  
            grp1 = h5w.create_group('Group_'+str(gcnt))  
            for scnt in range(nsgrps):  
                grp2 = grp1.create_group('SubGroup_'+str(scnt))  
                for dcnt in range(nds):  
                    i_arr = np.random.randint(1,100, (nrows,ncols) )  
                    ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)

我使用 numpy 如下：

import h5py
import numpy as np

def dump_calls2csv(name, node):    

    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       print ('visiting object:', node.name, ', exporting data to CSV')
       csvfname = node.name[1:].replace('/','_') +'.csv'
       arr = node[:]
       np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')

##########################    

with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :        
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!

我也使用了 PyTables 如下：

import tables as tb
import numpy as np

with tb.File('sample_file.h5', 'r') as h5r :     
    for node in h5r.walk_nodes('/',classname='Leaf') :         
       print ('visiting object:', node._v_pathname, 'export data to CSV')
       csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
       np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')

但我看到下面提到的每种方法的错误：

 C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
    h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
    return h5o.visit(self.id, proxy)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
  File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
  File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
    return func(name, self[name])
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
    np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
    open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'

Process finished with exit code 1

第二个代码的错误是：

C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
  value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
    v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
    np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
  File "<__array_function__ internals>", line 6, in savetxt
  File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
    % (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')

Process finished with exit code 1

谁能帮我解决这个问题？请提及我需要在代码上应用的确切更改并提供完整的代码，因为我的背景是注释编码，如果提供进一步的解释就更好了。

Answer 1

这还不是一个完整的答案。我用它来格式化我对你上面的评论的问题。
groups/datasets 的名字中有空格吗？
如果是这样，我认为这是我的简单示例中的一个问题。我从 group/dataset 名称路径创建每个 CSV 文件名。我用“_”替换了每个“/”。您需要对空格执行相同的操作（通过添加 .replace(' ','-') 将每个 ' ' 替换为 '-'。打印 csvfname 变量以确认它按预期工作（并创建了一个有效的文件名）。

如果这还不足以解决您的问题，请继续阅读。
我明白了：/Genotypes/genotype a/genotype a-1/calls 是您要写入 CSV 的数据集（每个 genotype x/genotype x-i/calls 数据集 1 个）如果是这样，您可能在数据集中的数据与用于写入的格式之间存在不匹配它。首先在 dump_calls2csv() 中打印 dtype，如下所示：print(arr.dtype)。注释掉 np.savetxt() 行直到它起作用。从错误消息中，我预计您会得到 "|S1" 而不是整数，这是一个问题，因为我的示例打印整数格式：fmt='%d'。理想情况下，您获得 dataset/array 的 dtype，然后创建 fmt= 字符串进行匹配。

希望对您有所帮助。如果没有，请用新信息更新您的问题。

Answer 2

我从您的评论中下载了示例。这是基于我的发现的新答案。如果所有 calls 数据集都有整数数据，那么 fmt='%d' 格式应该有效。我发现的唯一问题是从 group/dataset 路径创建的文件名中的无效字符。例如，: 和 ? 用于某些组名。我修改了dump_calls2csv()，将:替换为-，将?替换为#。运行这样，您应该将所有 calls 数据集写入 CSV 文件。请参阅下面的新代码：

def dump_calls2csv(name, node):         
    if isinstance(node, h5py.Dataset) and 'calls' in node.name :
       csvfname = node.name[1:] +'.csv'
       csvfname = csvfname.replace('/','_') # create csv file name from path
       csvfname = csvfname.replace(':','-') # modify invalid character
       csvfname = csvfname.replace('?','#') # modify invalid character
       print ('export data to CSV:', csvfname)
       np.savetxt(csvfname, node[:], fmt='%d', delimiter=',')

我打印 csvfname 以确认字符替换按预期工作。此外，如果名称有误，有助于确定问题数据集。

希望对您有所帮助。当你运行这个的时候要有耐心。我测试的时候，大约一半的CSV文件在45分钟内写完了。
在这一点上，我认为唯一的问题是文件名中的字符，与 HDF5、h5py 或 np.savetxt() 无关。对于一般情况（具有任何 group/dataset 个名称），应该进行测试以检查是否存在任何无效的文件名字符。

HDF5 到 CSV 转换期间的多个错误

Multiple Errors During HDF5 to CSV conversion

numpy

hdf5

pytables

h5py