如何将这种类型的numpy数组存储到HDF5中,在每一行中都有一个int和一个由多个int组成的numpy数组,每行的大小各不相同
How to store this type of numpy array into HDF5, in each row there is an int and a numpy array of several ints, which varies in size for each row
我的数据是这样的
array([[0, array([ 4928722, 3922609, 14413953, 10103423, 8948498])],
[1,
array([12557217, 5572869, 13415223, 2532000, 14609022, 9830632,
9800679, 7504595, 10752682])],
[2,
array([10458710, 7176517, 10268240, 4173086, 8617671, 4674075,
12580461, 2434641, 3694004, 9734870, 1314108, 8879955,
6468499, 12092464, 2962425, 13680848, 10590392, 10203584,
12816205, 7484678, 7985600, 12896218, 14882024, 6783345,
969850, 10709191, 4541728, 4312270, 6174902, 530425,
4843145, 4838613, 11404068, 9900162, 10578750, 12955180,
4602929, 4097386, 8870275, 7518195, 11849786, 2947773,
11653892, 7599644, 5895991, 1381764, 5853764, 11048535,
14128229, 11490202, 954680, 11998906, 9196156, 4506953,
6597761, 7034485, 3008940, 9816877, 1748801, 10159466,
2745090, 14842579, 788308, 5984365])],
...,
[62711, array([ 6159359, 5003282, 11818909, 11760670])],
[62712,
array([ 4363069, 8566447, 9547966, 14554871, 2108131, 12207856,
14840255, 13087558])],
[62713,
array([11252023, 8710787, 4233645, 11415316, 13888594, 7410770,
2298432, 9330913, 13715351, 8284109, 9142809, 3099529,
12366159, 10968492, 11123026, 1814941, 11209771, 10860521,
1798095, 4389487, 4461271, 10070622, 3689125, 880863,
13672430, 6677251, 10431890, 3447966, 12675925, 729773])]],
dtype=object)
每一行都有一个int,然后是几个int的numpy数组;第二个数组的大小可以在 2-200 个整数之间变化。
我正在想办法将其保存到 hdf5。
我试过这个方法
import h5py
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)
但是我遇到了这个错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-6667d439c206> in <module>()
1 import h5py
2 h5f = h5py.File('data.h5', 'w')
----> 3 h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)
1 frames
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
114 """
115 with phil:
--> 116 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
117 dset = dataset.Dataset(dsid)
118 if name is not None:
/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
98 else:
99 dtype = numpy.dtype(dtype)
--> 100 tid = h5t.py_create(dtype, logical=1)
101
102 # Legacy
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
这看起来是由于第二个数组的长度不同,这导致行以 'object' 的 dtype 存储,而 hdf5 无法识别。
有没有办法在 hdf5 中存储这种类型的数据?
这是重现问题的代码。它下载并打开了我的一小部分数据。我还提供了一个 colab notebook,这样用户就可以快速执行代码,而无需将任何内容下载到他们的系统中。
https://colab.research.google.com/drive/1kaaYk5_xbzQcXTr_DhjuWQT_3S4E-rML
完整代码:
import requests
import pickle
import numpy as np
import pandas as pd
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download_file_from_google_drive('1-V6iSeGFlpiouerNDLYtG3BI4d5ZLMfu', 'sample.npy')
sampleDF = np.load('sample.npy', allow_pickle= True)
import h5py
h5f = h5py.File('data2.h5', 'w')
h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)
正如评论中指出的那样,hdpy 具有 'vlen' 用于处理参差不齐的张量。
http://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data
但是,我不知道如何应用它。这是我的尝试
h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('dataset_1', data=sampleDF, dtype=dt, compression='gzip', compression_opts=9)
这是结果
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: Cannot return member number (operation not supported for type class)
Exception ignored in: 'h5py._proxy.make_reduced_type'
ValueError: Cannot return member number (operation not supported for type class)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-4256da5cbf76> in <module>()
2 h5f = h5py.File('data2.h5', 'w')
3 dt = h5py.special_dtype(vlen=np.dtype('int32'))
----> 4 h5f.create_dataset('dataset_1', data=new_array, dtype=dt, compression='gzip', compression_opts=9)
1 frames
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
114 """
115 with phil:
--> 116 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
117 dset = dataset.Dataset(dsid)
118 if name is not None:
/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
141
142 if (data is not None) and (not isinstance(data, Empty)):
--> 143 dset_id.write(h5s.ALL, h5s.ALL, data)
144
145 return dset_id
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.DatasetID.write()
h5py/_proxy.pyx in h5py._proxy.dset_rw()
h5py/_proxy.pyx in h5py._proxy.needs_proxy()
ValueError: Not a datatype (not a datatype)
正如@kcw78 指出的那样,分别存储列。
存储
h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('batch', data=sampleDF[:,1], dtype=dt, compression='gzip', compression_opts=9)
h5f.create_dataset('labels', data=sampleDF[:,0].astype(np.int32), dtype=dt, compression='gzip', compression_opts=9)
h5f.close()
打开
h5f2 = h5py.File('data.h5','r')
resurrectedDF = np.column_stack(( h5f2['labels'][:] , h5f2['batch'][:] ))
我的数据是这样的
array([[0, array([ 4928722, 3922609, 14413953, 10103423, 8948498])],
[1,
array([12557217, 5572869, 13415223, 2532000, 14609022, 9830632,
9800679, 7504595, 10752682])],
[2,
array([10458710, 7176517, 10268240, 4173086, 8617671, 4674075,
12580461, 2434641, 3694004, 9734870, 1314108, 8879955,
6468499, 12092464, 2962425, 13680848, 10590392, 10203584,
12816205, 7484678, 7985600, 12896218, 14882024, 6783345,
969850, 10709191, 4541728, 4312270, 6174902, 530425,
4843145, 4838613, 11404068, 9900162, 10578750, 12955180,
4602929, 4097386, 8870275, 7518195, 11849786, 2947773,
11653892, 7599644, 5895991, 1381764, 5853764, 11048535,
14128229, 11490202, 954680, 11998906, 9196156, 4506953,
6597761, 7034485, 3008940, 9816877, 1748801, 10159466,
2745090, 14842579, 788308, 5984365])],
...,
[62711, array([ 6159359, 5003282, 11818909, 11760670])],
[62712,
array([ 4363069, 8566447, 9547966, 14554871, 2108131, 12207856,
14840255, 13087558])],
[62713,
array([11252023, 8710787, 4233645, 11415316, 13888594, 7410770,
2298432, 9330913, 13715351, 8284109, 9142809, 3099529,
12366159, 10968492, 11123026, 1814941, 11209771, 10860521,
1798095, 4389487, 4461271, 10070622, 3689125, 880863,
13672430, 6677251, 10431890, 3447966, 12675925, 729773])]],
dtype=object)
每一行都有一个int,然后是几个int的numpy数组;第二个数组的大小可以在 2-200 个整数之间变化。
我正在想办法将其保存到 hdf5。
我试过这个方法
import h5py
h5f = h5py.File('data.h5', 'w')
h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)
但是我遇到了这个错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-6667d439c206> in <module>()
1 import h5py
2 h5f = h5py.File('data.h5', 'w')
----> 3 h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)
1 frames
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
114 """
115 with phil:
--> 116 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
117 dset = dataset.Dataset(dsid)
118 if name is not None:
/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
98 else:
99 dtype = numpy.dtype(dtype)
--> 100 tid = h5t.py_create(dtype, logical=1)
101
102 # Legacy
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
这看起来是由于第二个数组的长度不同,这导致行以 'object' 的 dtype 存储,而 hdf5 无法识别。
有没有办法在 hdf5 中存储这种类型的数据?
这是重现问题的代码。它下载并打开了我的一小部分数据。我还提供了一个 colab notebook,这样用户就可以快速执行代码,而无需将任何内容下载到他们的系统中。
https://colab.research.google.com/drive/1kaaYk5_xbzQcXTr_DhjuWQT_3S4E-rML
完整代码:
import requests
import pickle
import numpy as np
import pandas as pd
def download_file_from_google_drive(id, destination):
URL = "https://docs.google.com/uc?export=download"
session = requests.Session()
response = session.get(URL, params = { 'id' : id }, stream = True)
token = get_confirm_token(response)
if token:
params = { 'id' : id, 'confirm' : token }
response = session.get(URL, params = params, stream = True)
save_response_content(response, destination)
def get_confirm_token(response):
for key, value in response.cookies.items():
if key.startswith('download_warning'):
return value
return None
def save_response_content(response, destination):
CHUNK_SIZE = 32768
with open(destination, "wb") as f:
for chunk in response.iter_content(CHUNK_SIZE):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
download_file_from_google_drive('1-V6iSeGFlpiouerNDLYtG3BI4d5ZLMfu', 'sample.npy')
sampleDF = np.load('sample.npy', allow_pickle= True)
import h5py
h5f = h5py.File('data2.h5', 'w')
h5f.create_dataset('dataset_1', data=sampleDF, compression='gzip', compression_opts=9)
正如评论中指出的那样,hdpy 具有 'vlen' 用于处理参差不齐的张量。 http://docs.h5py.org/en/stable/special.html#arbitrary-vlen-data
但是,我不知道如何应用它。这是我的尝试
h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('dataset_1', data=sampleDF, dtype=dt, compression='gzip', compression_opts=9)
这是结果
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
ValueError: Cannot return member number (operation not supported for type class)
Exception ignored in: 'h5py._proxy.make_reduced_type'
ValueError: Cannot return member number (operation not supported for type class)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-4256da5cbf76> in <module>()
2 h5f = h5py.File('data2.h5', 'w')
3 dt = h5py.special_dtype(vlen=np.dtype('int32'))
----> 4 h5f.create_dataset('dataset_1', data=new_array, dtype=dt, compression='gzip', compression_opts=9)
1 frames
/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
114 """
115 with phil:
--> 116 dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
117 dset = dataset.Dataset(dsid)
118 if name is not None:
/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times)
141
142 if (data is not None) and (not isinstance(data, Empty)):
--> 143 dset_id.write(h5s.ALL, h5s.ALL, data)
144
145 return dset_id
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5d.pyx in h5py.h5d.DatasetID.write()
h5py/_proxy.pyx in h5py._proxy.dset_rw()
h5py/_proxy.pyx in h5py._proxy.needs_proxy()
ValueError: Not a datatype (not a datatype)
正如@kcw78 指出的那样,分别存储列。
存储
h5f = h5py.File('data.h5', 'w')
dt = h5py.special_dtype(vlen=np.dtype('int32'))
h5f.create_dataset('batch', data=sampleDF[:,1], dtype=dt, compression='gzip', compression_opts=9)
h5f.create_dataset('labels', data=sampleDF[:,0].astype(np.int32), dtype=dt, compression='gzip', compression_opts=9)
h5f.close()
打开
h5f2 = h5py.File('data.h5','r')
resurrectedDF = np.column_stack(( h5f2['labels'][:] , h5f2['batch'][:] ))