HDFStore 附加到数据集 - 适用于较小的子集，中断较大的子集

Question

我正在将高维数据集 (90*80000) 加载到分块 pandas 数据帧中。使用 HDF5store 我想将此数据集写入 .hdf5。我将数据集划分为矩阵 90*6 和 90*remaining columns.

我正在使用食谱中概述的方法并尝试了我在互联网上发现的不同解决方案 - 但无济于事。我认为问题可能出在第二个分区的 header 太大（考虑到 64kb 的限制）。但是，我认为我只将矩阵传递给 .append 命令，而不是整个数据帧。

这是我的代码：

i=0
reader = pd.read_csv(dataFile, delim_whitespace=True, chunksize=10, names=header, skiprows=1)
    for chunk in reader:
        # if i==0:
        #     print chunk.ix[:,6:].values
        #     store['df'] = chunk.ix[:,6:]
        #     print type(store['df'])
        #     print store['df'].shape
        # else:
        #     store.append('df', pd.DataFrame(chunk.ix[:,6:]))
        #     print store['df'].shape
        store.append('a',chunk.ix[:,6:])
        store.append('ID', chunk.ix[:,:6])         #the only command that works
        chunk.ix[:,6:].to_hdf(store, 'df', format="table", append=True)
        store.append('df', chunk.ix[:,6:].values)
        i+=1

这些是我单独尝试过的一堆选项，其中 none 个都有效，除了小数据部分。注释版本将第一个块写入数据集，但随后抱怨它只能附加到 'tables'.

一般错误信息：

Traceback (most recent call last): File "D:/OneDrive/Research/2016 Research Project/python files/rawToHdf5TutChunked.py", line 109, in store.append('a',chunk.ix[:,6:]) File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 919, in append **kwargs) File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 1264, in _write_to_group s.write(obj=value, append=append, complib=complib, **kwargs) File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3801, in write self.set_attrs() File "C:\Python27\lib\site-packages\pandas\io\pytables.py", line 3052, in set_attrs self.attrs.non_index_axes = self.non_index_axes File "C:\Python27\lib\site-packages\tables\attributeset.py", line 461, in setattr self._g__setattr(name, value) File "C:\Python27\lib\site-packages\tables\attributeset.py", line 403, in _g__setattr self._g_setattr(self._v_node, name, stvalue) File "tables\hdf5extension.pyx", line 715, in tables.hdf5extension.AttributeSet._g_setattr (tables\hdf5extension.c:7315) tables.exceptions.HDF5ExtError: HDF5 error back trace

文件 "J:\dev\src\hdf5_1_8_cmake\src\H5A.c"，第 259 行，在 H5Acreate2 中无法在 H5A_create 中创建属性文件 "J:\dev\src\hdf5_1_8_cmake\src\H5Aint.c"，第 275 行无法在 object header 文件 "J:\dev\src\hdf5_1_8_cmake\src\H5Oattribute.c" 的第 347 行中创建属性 H5O_attr_create 无法在 header 文件 "J:\dev\src\hdf5_1_8_cmake\src\H5Omessage.c" 的第 224 行中创建新属性 H5O_msg_append_real 无法创建新消息文件 "J:\dev\src\hdf5_1_8_cmake\src\H5Omessage.c"，第 1945 行，位于 H5O_msg_alloc 无法为消息文件 "J:\dev\src\hdf5_1_8_cmake\src\H5Oalloc.c" 分配 space，第 1142 行，在 H5O_alloc 中 object header 消息太大

HDF5 错误回溯结束

无法在节点中设置属性 'non_index_axes'：/a（组）''。关闭保持打开状态 files:t.hdf5...完成

我不习惯处理大型数据集，因此欢迎任何意见。

Answer 1

这个变通方法现在运行良好。不过，我放弃了使用 HDF5Store。基本上我正在使用 h5py 创建一个数据集并对其进行整体分配，然后 trim 空行之后：

reader = pd.read_csv(datafile, delim_whitespace=True, chunksize=args.chunksize, names=header, skiprows=1)
max_rows_est = 880000   #estimate for maximum number of rows per file
i=0                     #counter for no of iterations through loop
n_rows=0                #counter of no of processed rows

for chunk in reader:
     big_chunk_shape = chunk.ix[:,6:].shape
     if i == 0:
         print 'creating dataset {0}'.format(j+1)
         dset2_data = f_hdf5.create_dataset('{0}'.format(j+1), (max_rows_est, big_chunk_shape[1]), maxshape=(None, big_chunk_shape[1]), dtype='<i4')
     dset2_data[i*big_chunk_shape[0]:(i+1)*big_chunk_shape[0], :] = chunk.ix[:,6:].values
     n_rows += big_chunk_shape[0]
     i+=1

dset2_data.resize((n_rows, big_chunk_shape[1]))

HDFStore 附加到数据集 - 适用于较小的子集，中断较大的子集

HDFStore append to dataset - works for smaller subset, breaks for larger subset

python

hdf5

pandas