HDF5 到 CSV 转换期间的多个错误
Multiple Errors During HDF5 to CSV conversion
我有一个巨大的 h5 文件,我需要将每个数据集提取到一个单独的 csv 文件中。该模式类似于 /Genotypes/GroupN/SubGroupN/calls,带有 'N' 组和 'N' 子组。我已经创建了与主文件具有相同结构的示例 h5 文件并测试了可以正常工作的代码,但是当我在我的主 h5 文件上应用代码时它遇到了各种错误。
HDF5 文件的模式:
/Genotypes
/genotype a
/genotype a_1 #one subgroup for each genotype group
/calls #data that I need to extract to csv file
depth #data
/genotype b
/genotype b_1 #one subgroup for each genotype group
/calls #data
depth #data
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/calls
depth
/Positions
/allel #data
chromo #data#
/Taxa
/genotype a
/genotype a_1
/genotype b
/genotype b_1 #one subgroup for each genotype group
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/_Data-Types_
Enum_Boolean
String_VariableLength
这是创建示例 h5 文件的代码:
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )
with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)
我使用 numpy
如下:
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
我也使用了 PyTables
如下:
import tables as tb
import numpy as np
with tb.File('sample_file.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')
但我看到下面提到的每种方法的错误:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
return h5o.visit(self.id, proxy)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
return func(name, self[name])
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'
Process finished with exit code 1
第二个代码的错误是:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
% (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')
Process finished with exit code 1
谁能帮我解决这个问题?
请提及我需要在代码上应用的确切更改并提供完整的代码,因为我的背景是注释编码,如果提供进一步的解释就更好了。
这还不是一个完整的答案。我用它来格式化我对你上面的评论的问题。
groups/datasets 的名字中有空格吗?
如果是这样,我认为这是我的简单示例中的一个问题。我从 group/dataset 名称路径创建每个 CSV 文件名。我用“_”替换了每个“/”。您需要对空格执行相同的操作(通过添加 .replace(' ','-')
将每个 ' ' 替换为 '-'。打印 csvfname
变量以确认它按预期工作(并创建了一个有效的文件名)。
如果这还不足以解决您的问题,请继续阅读。
我明白了:/Genotypes/genotype a/genotype a-1/calls
是您要写入 CSV 的数据集(每个 genotype x/genotype x-i/calls
数据集 1 个)如果是这样,您可能在数据集中的数据与用于写入的格式之间存在不匹配它。首先在 dump_calls2csv()
中打印 dtype
,如下所示:print(arr.dtype)
。注释掉 np.savetxt()
行直到它起作用。从错误消息中,我预计您会得到 "|S1"
而不是整数,这是一个问题,因为我的示例打印整数格式:fmt='%d'
。理想情况下,您获得 dataset/array 的 dtype
,然后创建 fmt=
字符串进行匹配。
希望对您有所帮助。如果没有,请用新信息更新您的问题。
我从您的评论中下载了示例。这是基于我的发现的新答案。如果所有 calls
数据集都有整数数据,那么 fmt='%d'
格式应该有效。
我发现的唯一问题是从 group/dataset 路径创建的文件名中的无效字符。例如,:
和 ?
用于某些组名。我修改了dump_calls2csv()
,将:
替换为-
,将?
替换为#
。
运行 这样,您应该将所有 calls
数据集写入 CSV 文件。请参阅下面的新代码:
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
csvfname = node.name[1:] +'.csv'
csvfname = csvfname.replace('/','_') # create csv file name from path
csvfname = csvfname.replace(':','-') # modify invalid character
csvfname = csvfname.replace('?','#') # modify invalid character
print ('export data to CSV:', csvfname)
np.savetxt(csvfname, node[:], fmt='%d', delimiter=',')
我打印 csvfname
以确认字符替换按预期工作。此外,如果名称有误,有助于确定问题数据集。
希望对您有所帮助。当你 运行 这个的时候要有耐心。我测试的时候,大约一半的CSV文件在45分钟内写完了。
在这一点上,我认为唯一的问题是文件名中的字符,与 HDF5、h5py
或 np.savetxt()
无关。对于一般情况(具有任何 group/dataset 个名称),应该进行测试以检查是否存在任何无效的文件名字符。
我有一个巨大的 h5 文件,我需要将每个数据集提取到一个单独的 csv 文件中。该模式类似于 /Genotypes/GroupN/SubGroupN/calls,带有 'N' 组和 'N' 子组。我已经创建了与主文件具有相同结构的示例 h5 文件并测试了可以正常工作的代码,但是当我在我的主 h5 文件上应用代码时它遇到了各种错误。 HDF5 文件的模式:
/Genotypes
/genotype a
/genotype a_1 #one subgroup for each genotype group
/calls #data that I need to extract to csv file
depth #data
/genotype b
/genotype b_1 #one subgroup for each genotype group
/calls #data
depth #data
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/calls
depth
/Positions
/allel #data
chromo #data#
/Taxa
/genotype a
/genotype a_1
/genotype b
/genotype b_1 #one subgroup for each genotype group
.
.
.
/genotype n #1500 genotypes are listed as groups
/genotype n_1
/_Data-Types_
Enum_Boolean
String_VariableLength
这是创建示例 h5 文件的代码:
import h5py
import numpy as np
ngrps = 2
nsgrps = 3
nds = 4
nrows = 10
ncols = 2
i_arr_dtype = ( [ ('col1', int), ('col2', int) ] )
with h5py.File('d:/Path/sample_file.h5', 'w') as h5w :
for gcnt in range(ngrps):
grp1 = h5w.create_group('Group_'+str(gcnt))
for scnt in range(nsgrps):
grp2 = grp1.create_group('SubGroup_'+str(scnt))
for dcnt in range(nds):
i_arr = np.random.randint(1,100, (nrows,ncols) )
ds = grp2.create_dataset('calls_'+str(dcnt), data=i_arr)
我使用 numpy
如下:
import h5py
import numpy as np
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
print ('visiting object:', node.name, ', exporting data to CSV')
csvfname = node.name[1:].replace('/','_') +'.csv'
arr = node[:]
np.savetxt(csvfname, arr, fmt='%5d', delimiter=',')
##########################
with h5py.File('d:/Path/sample_file.h5', 'r') as h5r :
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
我也使用了 PyTables
如下:
import tables as tb
import numpy as np
with tb.File('sample_file.h5', 'r') as h5r :
for node in h5r.walk_nodes('/',classname='Leaf') :
print ('visiting object:', node._v_pathname, 'export data to CSV')
csvfname = node._v_pathname[1:].replace('/','_') +'.csv'
np.savetxt(csvfname, node.read(), fmt='%5d', delimiter=',')
但我看到下面提到的每种方法的错误:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
visiting object: /Genotypes/Genotype a/genotye a_1/calls , exporting data to CSV
.
.
.
some of the datasets
.
.
.
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 31, in <module>
h5r.visititems(dump_calls2csv) #NOTE: function name is NOT a string!
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 565, in visititems
return h5o.visit(self.id, proxy)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5o.pyx", line 355, in h5py.h5o.visit
File "h5py\defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name
File "h5py\h5o.pyx", line 302, in h5py.h5o.cb_obj_simple
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\h5py\_hl\group.py", line 564, in proxy
return func(name, self[name])
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 10, in dump_calls2csv
np.savetxt(csv_name, arr, fmt='%5d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1377, in savetxt
open(fname, 'wt').close()
OSError: [Errno 22] Invalid argument: 'Genotypes_Genotype_Name-Genotype_Name2_calls.csv'
Process finished with exit code 1
第二个代码的错误是:
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\python.exe C:\Users\...\PycharmProjects\DLLearn\datapreparation.py
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'locked' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\tables\attributeset.py:308: DataTypeWarning: Unsupported type for attribute 'retainRareAlleles' in node 'Genotypes'. Offending HDF5 class: 8
value = self._g_getattr(self._v_node, name)
visiting object: /Genotypes/AlleleStates export data to CSV
Traceback (most recent call last):
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1447, in savetxt
v = format % tuple(row) + newline
TypeError: %d format: a number is required, not numpy.bytes_
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\...\PycharmProjects\DLLearn\datapreparation.py", line 40, in <module>
np.savetxt(csvfname, node.read(), fmt= '%d', delimiter=',')
File "<__array_function__ internals>", line 6, in savetxt
File "C:\Users\...\AppData\Local\Continuum\anaconda3\envs\DLLearn\lib\site-packages\numpy\lib\npyio.py", line 1451, in savetxt
% (str(X.dtype), format))
TypeError: Mismatch between array dtype ('|S1') and format specifier ('%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d,%d')
Process finished with exit code 1
谁能帮我解决这个问题? 请提及我需要在代码上应用的确切更改并提供完整的代码,因为我的背景是注释编码,如果提供进一步的解释就更好了。
这还不是一个完整的答案。我用它来格式化我对你上面的评论的问题。
groups/datasets 的名字中有空格吗?
如果是这样,我认为这是我的简单示例中的一个问题。我从 group/dataset 名称路径创建每个 CSV 文件名。我用“_”替换了每个“/”。您需要对空格执行相同的操作(通过添加 .replace(' ','-')
将每个 ' ' 替换为 '-'。打印 csvfname
变量以确认它按预期工作(并创建了一个有效的文件名)。
如果这还不足以解决您的问题,请继续阅读。
我明白了:/Genotypes/genotype a/genotype a-1/calls
是您要写入 CSV 的数据集(每个 genotype x/genotype x-i/calls
数据集 1 个)如果是这样,您可能在数据集中的数据与用于写入的格式之间存在不匹配它。首先在 dump_calls2csv()
中打印 dtype
,如下所示:print(arr.dtype)
。注释掉 np.savetxt()
行直到它起作用。从错误消息中,我预计您会得到 "|S1"
而不是整数,这是一个问题,因为我的示例打印整数格式:fmt='%d'
。理想情况下,您获得 dataset/array 的 dtype
,然后创建 fmt=
字符串进行匹配。
希望对您有所帮助。如果没有,请用新信息更新您的问题。
我从您的评论中下载了示例。这是基于我的发现的新答案。如果所有 calls
数据集都有整数数据,那么 fmt='%d'
格式应该有效。
我发现的唯一问题是从 group/dataset 路径创建的文件名中的无效字符。例如,:
和 ?
用于某些组名。我修改了dump_calls2csv()
,将:
替换为-
,将?
替换为#
。
运行 这样,您应该将所有 calls
数据集写入 CSV 文件。请参阅下面的新代码:
def dump_calls2csv(name, node):
if isinstance(node, h5py.Dataset) and 'calls' in node.name :
csvfname = node.name[1:] +'.csv'
csvfname = csvfname.replace('/','_') # create csv file name from path
csvfname = csvfname.replace(':','-') # modify invalid character
csvfname = csvfname.replace('?','#') # modify invalid character
print ('export data to CSV:', csvfname)
np.savetxt(csvfname, node[:], fmt='%d', delimiter=',')
我打印 csvfname
以确认字符替换按预期工作。此外,如果名称有误,有助于确定问题数据集。
希望对您有所帮助。当你 运行 这个的时候要有耐心。我测试的时候,大约一半的CSV文件在45分钟内写完了。
在这一点上,我认为唯一的问题是文件名中的字符,与 HDF5、h5py
或 np.savetxt()
无关。对于一般情况(具有任何 group/dataset 个名称),应该进行测试以检查是否存在任何无效的文件名字符。