Python/PyTables: 数组的不同列可以有不同的数据类型吗？

Question

我创建了一个 Nx4 列的可扩展耳阵列。有些列需要 float64 数据类型，其他列可以使用 int32 进行管理。是否可以改变列之间的数据类型？现在我只对所有文件使用一个（float64，下面），但它需要巨大的磁盘 space 用于（>10 GB）文件。

例如，如何确保第 1-2 列元素为 int32，第 3-4 列元素为 float64？

import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))

这是我如何使用 Earray 附加的简单版本：

Matrix = np.ones(shape=(10**6, 4))

if counter <= 10**6: # keep appending to Matrix until 10**6 rows
    Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
    s += length

# save to disk when rows = 10**6
if counter > 10**6:
    a.append(Matrix[:s])  
    del Matrix
    Matrix = np.ones(shape=(10**6, 4))

以下方法的缺点是什么？

import tables as tb
import numpy as np

filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))

# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
                [2, 2],
                [3, 3]], dtype=np.int32)

# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
                 [1.1,1.2],
                 [1.1,1.2]], dtype=np.float64)

for i in range(3):
    int_app.append(arr1)
    float_app.append(arr2)

f.close()

print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)

Answer 1

否，是。所有 PyTables 数组类型（Array、CArray、EArray、VLArray）都用于同类数据类型（类似于 NumPy ndarray）。如果要混合数据类型，则需要使用 Table。 Tables 是可扩展的；他们有一个 .append() 方法来添加数据行。

创建过程类似于这个答案（只是dtype不同）：PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class

您的代码将如下所示：

import tables as tb
import numpy as np
table_dt = np.dtype(
           {'names': ['int1', 'int2', 'float1', 'float2'], 
            'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)

with tb.File('table.h5', 'w') as h5f:
    a = h5f.create_table('/', 'dataset_1', description=table_dt)

# Method 1 to create empty recarray 'Matrix', then add data:     
    Matrix = np.recarray( (10**6,), dtype=table_dt)
    Matrix['int1'] = i1
    Matrix['int2'] = i2
    Matrix['float1'] = f1
    Matrix['float2'] = f2        
# Append Matrix to the table
    a.append(Matrix)

# Method 2 to create recarray 'Matrix' with data in 1 step:       
    Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
    a.append(Matrix)

您提到创建一个非常大的文件，但没有说明有多少行（显然超过 10**6）。以下是基于另一个线程中评论的一些额外想法。

.create_table() 方法有一个可选参数：expectedrows=。此参数用于“优化 HDF5 B-Tree 和使用的内存量”。默认值在 tables/parameters.py 中设置（查找 EXPECTED_ROWS_TABLE；在我的安装中它只有 10000。）如果您要创建 10**6（或更多）行，我强烈建议您将其设置为更大的值.

此外，您应该考虑文件压缩。有一个trade-off：压缩会减小文件大小，但会降低I/O 性能（增加访问时间）。有几个选项：

创建文件时启用压缩（创建文件时添加filters=参数）。从 tb.Filters(complevel=1).
使用 HDF 组实用程序 h5repack - 运行针对 HDF5 文件创建新文件（对于从未压缩到压缩或 vice-versa 很有用）。
使用 PyTables 实用程序 ptrepack - 与 h4repack 类似并随 PyTables 一起提供。

我倾向于使用我经常使用的未压缩文件以获得最佳 I/O 性能。然后完成后，我转换为压缩格式以进行长期存档。

Python/PyTables: 数组的不同列可以有不同的数据类型吗？

Python/PyTables: Is it possible to have different data types for different columns of an array?

python

arrays

numpy

pytables

pandas