Python 中 DataFrames 的内存高效查找 table

Question

在我提问时，一位回复者建议我将数据组织为 DataFrame 的 DataFrame。

df = pd.DataFrame({'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Data': {0: pd.DataFrame([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                            1: pd.DataFrame([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                            2: pd.DataFrame([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                            3: pd.DataFrame([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                            4: pd.DataFrame([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                            5: pd.DataFrame([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]])} })

DataFrame 的这个 DataFrame 允许我有条件地 select 数据组，例如df[(df['make'] == 'SUV') and (df['age']<=40)]['Data']。问题是当每行数据本身很大时.csv，就很难加载到内存中。

我正在寻找像 h5py 这样的模块，它可以“流式传输”/读取数据的特定部分（允许指定密钥，例如 df = pd.read_hdf('large_data.hdf', 'SUV-Ford-White-25')，除了 而不是比嵌套字典我更喜欢它是允许过滤的table，例如df = module.read(large_data.some_ext, make == 'SUV', 20 <= age <= 40)。xarray或pandas 有内置的东西吗？

Answer 1

与 h5py 一样，PyTables（又名 tables）也可以创建和读取 HDF5 文件。 Pandas 使用 PyTables“在幕后”创建和读取 HDF5 文件。 PyTables 有一些有用的搜索功能，可以准确地完成你想做的事情。为了完整起见，我在这个答案的末尾包含了一个简短的总结，对每个包进行了比较。

这是我创建的一个示例，用于演示使用您的数据框（字典）数据的搜索行为。

创建 HDF5 文件:
注意：创建 HDF5 文件的大部分“工作”是（重新）将您的字典数据组织到 NumPy recarray 中。如果修改数据结构（移动字典 key/value 级别），则可以简化该过程——假定结构尚未设置。
步骤总结：

创建一个 np.dtype 来定义数据的字段（列）。
通过计算与每个主键关联的字典项数来确定重新排列行。
用上面的 1 和 2 创建零的重新排列。
遍历字典并将键和值映射到适当的行和字段（列）名称。

代码如下:

import tables as tb
import numpy as np

data_dict = {'Form': {0:'SUV', 1:'Truck', 2:'SUV', 3:'Sedan', 4:'SUV', 5:'Truck'},
                   'Make': {0:'Ford', 1:'Toyota', 2:'Honda', 3:'Ford', 4:'Honda', 5:'Toyota'},
                   'Color': {0:'White', 1:'Black', 2:'Gray', 3:'White', 4:'White', 5:'Black'},
                   'Driver_age': {0:25, 1:37, 2:21, 3:54, 4:50, 5:67},
                   'Data': {0: np.array([[0, 0], [0.25, 1.7], [1.2, 1.8], [4.5, 4.0]]), 
                            1: np.array([[0, 0], [0.15, 1.3], [1.6, 1.3], [4.2, 4.1]]), 
                            2: np.array([[0, 0], [0.24, 1.2], [1.3, 1.6], [4.1, 3.9]]), 
                            3: np.array([[0, 0], [0.45, 1.6], [1.8, 1.8], [4.2, 4.6]]), 
                            4: np.array([[0, 0], [0.85, 1.9], [1.5, 1.7], [4.5, 4.3]]), 
                            5: np.array([[0, 0], [0.35, 1.8], [1.5, 1.8], [4.6, 4.1]])} }

recarr_dt = np.dtype( [ ('Form','S10'), ('Make','S10') , ('Color','S10'),
                        ('Driver_age',int), ('Data',float, (4,2)) ] )
nrows = 0
for k, d in data_dict.items():
    nrows = max(nrows, len(d))

recarr = np.zeros(shape=(nrows,), dtype=recarr_dt)  

for k1, v1 in data_dict.items():
    for k2, v2 in  v1.items():
        recarr[k2][k1] = v2
        
with tb.File('SO_71388372.h5','w') as h5w:
    h5w.create_table('/', 'test', obj=recarr)

打开并搜索HDF5文件:
此示例演示了使用 Table.read_where(condition) 方法进行的 2 次搜索。它显示了多个搜索条件的语法。一些注意事项：

多个条件需要括号
没有复合条件(20 <= Driver_age <= 40)是2个条件
字符串输入为 b"text"（b/c HDF5 字符串不是 Unicode）。

代码如下:

import tables as tb
with tb.File('SO_71388372.h5','r') as h5r:
    data_tbl = h5r.root.test
    
    condition = '(Form == b"SUV") & (20 <= Driver_age) & (Driver_age <= 40)'
    data_arr = data_tbl.read_where(condition)
    print(f'\nFor search condition: {condition}')
    print(f'# of rows found: {data_arr.shape}')
    for row in data_arr:
        print(row)
        
    condition = '(Form == b"SUV") & (Make == b"Honda")'
    data_arr = data_tbl.read_where(condition)
    print(f'\nFor search condition: {condition}')
    print(f'# of rows found: {data_arr.shape}')
    for row in data_arr:
        print(row)

这里是从各自的常见问题解答页面中提取的每个包的摘要。

PyTables（来自 PyTables FAQ）:
在 HDF5 和 NumPy 之上构建一个额外的抽象层。具有支持复杂查询的引擎、高效的计算内核和高级索引功能。有一个自定义系统来表示 HDF5 库中可用但 NumPy 中不可用的数据类型。

h5py（来自 h5py FAQ）:
尝试将 HDF5 功能集尽可能接近地映射到 NumPy。还提供对几乎所有 HDF5 C API 的访问。 high-level 类型系统专门使用 NumPy dtype 对象，方法和属性命名遵循 Python 和 NumPy 字典和数组访问约定。

Python 中 DataFrames 的内存高效查找 table

Memory-efficient lookup table of DataFrames in Python

python

dataframe

h5py

pandas

python-xarray