将数组存储在表中（以及如何定义 Col() 类型）

Question

TL;DR：我有一个带有 float32 Col 的 PyTable，在向其中写入 numpy-float32-array 时出现错误。 （如何）我可以在 PyTables 的列中存储一个 numpy 数组 (float32) table？

我是 PyTables 的新手 - 根据 TFtables（在 Tensorflow 中使用 HDF5 的库）的建议，我用它来存储我所有的 HDF5 数据（目前分批分布在几个文件中每个三个数据集）在单个 HDF5 文件中的 table 中。数据集是

'data' : (n_elements, 1024, 1024, 4)@float32
'label' : (n_elements, 1024, 1024, 1)@uint8
'weights' : (n_elements, 1024, 1024, 1)@float32

其中 n_elements 分布在我现在想合并为一个文件的几个文件中（以允许无序访问）。

因此，当我构建 table 时，我认为每个数据集代表一列。我以一种通用的方式构建了所有内容，允许对任意数量的数据集执行此操作：

# gets dtypes (and shapes) of the dsets (accessed by dset_keys = ['data', 'label', 'weights']
dtypes, shapes = _determine_shape(hdf5_files, dset_keys)

# to dynamically generate a table, I'm using a dict (not a class as in the PyTables tutorials)
# the dict is (conform with the doc): { 'col_name' : Col()-class-descendent }
table_description = {dset_keys[i]: tables.Col.from_dtype(dtypes[i]) for i in range(len(dset_keys))}

# create a file, a group-node and attach a table to it
h5file = tables.open_file(destination_file, mode="w", title="merged")
group = h5file.create_group("/", 'main', 'Node for data table')
table = h5file.create_table(group, 'data_table', table_description, "Collected data with %s" % (str(val_keys)))

我为每个数据集（使用 h5py 读取）获得的数据类型显然是读取数据集 returns 的 numpy 数组（ndarray）中的数据类型：float32 或 uint8 .所以 Col() 类型是 Float32Col 和 UInt8Col。我天真地假设我现在可以将一个 float32 数组写入此 col，但使用以下内容填充数据：

dummy_data = np.zeros([1024,1024,3], float32) # normally data read from other files

sample = table.row
sample['data'] = dummy_data

结果为 TypeError: invalid type (<class 'numpy.ndarray'>) for column ``data``。所以现在我觉得假设我能够在那里写一个数组是愚蠢的，但是没有提供 "ArrayCol()" 类型，PyTables doc 中也没有任何关于它是否或如何的提示可以将数组写入列。我该怎么做？

Col() class 及其后代中有 "shape" 个参数，所以应该可以，否则这些是干什么用的？！

Answer 1

编辑： 我刚刚看到 tables.Col.from_type(type, shape) 允许使用类型的精度（float32 而不是单独的 float ).其余保持不变（采用字符串和形状）。

工厂函数tables.Col.from_kind(kind, shape)可用于构建支持ndarrays的Col-Type。 "kind" 是什么以及如何使用它在我找到的任何地方都没有记录；然而，通过反复试验，我发现允许的 "kind"s 是 strings 基本数据类型。即：'float'、'uint'、……没有精度（不是'float64'）

因为我从 h5py 读取数据集 (dset.dtype) 得到 numpy.dtypes，所以必须将这些转换为 str 并且需要删除精度。最后，相关行如下所示：

# get key, dtype and shapes of elements per dataset from the datasource files
val_keys, dtypes, element_shapes = _get_dtypes(datasources, element_axis=element_axis)

# for storing arrays in columns apparently one has to use "kind"
# "kind" cannot be created with dtype but only a string representing 
# the dtype w/o precision, e.g. 'float' or 'uint' 
dtypes_kind = [''.join(i for i in str(dtype) if not i.isdigit()) for dtype in dtypes]

# create table description as dictionary
description = {val_keys[i]: tables.Col.from_kind(dtypes_kind[i], shape=element_shapes[i]) for i in range(len(val_keys))}

然后将数据写入 table 最终按建议工作：

sample = table.row
sample[key] = my_array

因为这一切感觉有点 "hacky" 并且没有很好地记录，我仍然想知道，这是否不是 PyTables 的预期用途，并且会让这个问题悬而未决，看看是否 s.o。对此了解更多...

Answer 2

我知道有点晚了，但我认为您的问题的答案在于 Float32Col 的形状参数。

文档中的用法如下：


<pre><code>from tables import *
from numpy import *

# Describe a particle record
class Particle(IsDescription):
    name        = StringCol(itemsize=16)  # 16-character string
    lati        = Int32Col()              # integer
    longi       = Int32Col()              # integer
    pressure    = Float32Col(shape=(2,3)) # array of floats (single-precision)
    temperature = Float64Col(shape=(2,3)) # array of doubles (double-precision)

# Open a file in "w"rite mode
fileh = open_file("tutorial2.h5", mode = "w")

# Get the HDF5 root group
root = fileh.root

# Create the groups:
for groupname in ("Particles", "Events"):
    group = fileh.create_group(root, groupname)

# Now, create and fill the tables in Particles group
gparticles = root.Particles

# Create 3 new tables
for tablename in ("TParticle1", "TParticle2", "TParticle3"):
    # Create a table
    table = fileh.create_table("/Particles", tablename, Particle, "Particles: "+tablename)

    # Get the record object associated with the table:
    particle = table.row

    # Fill the table with 257 particles
    for i in xrange(257):
        # First, assign the values to the Particle record
        particle['name'] = 'Particle: %6d' % (i)
        particle['lati'] = i
        particle['longi'] = 10 - i

        ########### Detectable errors start here. Play with them!
        particle['pressure'] = array(i*arange(2*3)).reshape((2,4))  # Incorrect
        #particle['pressure'] = array(i*arange(2*3)).reshape((2,3)) # Correct
        ########### End of errors

        particle['temperature'] = (i**2)     # Broadcasting

        # This injects the Record values
        particle.append()

    # Flush the table buffers
    table.flush()

这是我所指的文档部分的 link https://www.pytables.org/usersguide/tutorials.html

将数组存储在表中（以及如何定义 Col() 类型）

Store ndarray in a PyTable (and how to define the Col()-type)

python

arrays

numpy

pytables