使用 h5py 保存 pandas DataFrame 以实现与其他 hdf5 阅读器的互操作性
Save pandas DataFrame using h5py for interoperabilty with other hdf5 readers
这是一个示例数据框:
import pandas as pd
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
我知道 pandas 有基于 HDFStore 的 pytables,这是一种高效 serialize/deserialize 数据框的简单方法。但是这些数据集不是很容易直接使用 reader h5py 或 matlab 加载。如何使用 h5py 保存数据框,以便我可以使用另一个 hdf5 轻松加载它 reader?
这是我解决这个问题的方法。我希望其他人有更好的解决方案或者我的方法对其他人有帮助。
首先,定义函数以从 pandas DataFrame 生成一个 numpy 结构数组(不是记录数组)。
import numpy as np
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
使用reset_index
创建一个新的数据框,将索引作为其数据的一部分。将该数据框转换为结构数组。
sa = df_to_sarray(df.reset_index())
sa
array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
(4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
(7L, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
将该结构化数组保存到 hdf5 文件。
import h5py
with h5py.File('mydata.h5', 'w') as hf:
hf['df'] = sa
加载 h5 数据集
with h5py.File('mydata.h5') as hf:
sa2 = hf['df'][:]
提取 ID 列并将其从 sa2 中删除
import numpy.lib.recfunctions as nprec
ID = sa2['ID']
sa2 = nprec.drop_fields(sa2, 'ID')
使用sa2制作带有索引ID的数据框
df2 = pd.DataFrame(sa2, index=ID)
df2.index.name = 'ID'
print(df2)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
pandas HDFStore
格式是标准的 HDF5 格式,只是关于如何解释元数据的约定。文档是 here
In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
In [55]: h = h5py.File('test.h5')
In [56]: h['df']['table']
Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">
In [64]: h['df']['table'][:]
Out[64]:
array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [57]: h['df']['table'].attrs.items()
Out[57]:
[(u'CLASS', 'TABLE'),
(u'VERSION', '2.7'),
(u'TITLE', ''),
(u'FIELD_0_NAME', 'index'),
(u'FIELD_1_NAME', 'A'),
(u'FIELD_2_NAME', 'B'),
(u'FIELD_3_NAME', 'C'),
(u'FIELD_0_FILL', 0),
(u'FIELD_1_FILL', 0.0),
(u'FIELD_2_FILL', 0.0),
(u'FIELD_3_FILL', 0.0),
(u'index_kind', 'integer'),
(u'A_kind', "(lp1\nS'A'\na."),
(u'A_meta', 'N.'),
(u'A_dtype', 'float64'),
(u'B_kind', "(lp1\nS'B'\na."),
(u'B_meta', 'N.'),
(u'B_dtype', 'float64'),
(u'C_kind', "(lp1\nS'C'\na."),
(u'C_meta', 'N.'),
(u'C_dtype', 'float64'),
(u'NROWS', 7)]
In [58]: h.close()
数据将在任何 HDF5 reader 中完全可读。一些元数据被腌制,所以必须小心。
为了对大家有帮助,我采纳了 from Guillaume and Phil, and changed it a bit for my needs with the help of ankostis。我们从 CSV 文件中读取 pandas DataFrame。
我主要针对 Strings
对其进行了改编,因为您无法将对象存储在 HDF5 文件中(我相信)。首先检查哪些列类型是 numpy objects
。然后检查哪个是该列的最长长度,并将该列固定为该长度的字符串。其余的和其他的很相似post.
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
Also, for every column of a str type, convert it into
a 'bytes' str literal of length = max(len(col)).
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
def make_col_type(col_type, col):
try:
if 'numpy.object_' in str(col_type.type):
maxlens = col.dropna().str.len()
if maxlens.any():
maxlen = maxlens.max().astype(int)
col_type = ('S%s' % maxlen, 1)
else:
col_type = 'f2'
return col.name, col_type
except:
print(col.name, col_type, col_type.type, type(col))
raise
v = df.values
types = df.dtypes
numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns]
dtype = np.dtype(numpy_struct_types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
# This is in case you have problems with the encoding, remove the if branch if not
try:
if dtype[i].str.startswith('|S'):
z[k] = df[k].str.encode('latin').astype('S')
else:
z[k] = v[:, i]
except:
print(k, v[:, i])
raise
return z, dtype
所以工作流程是:
import h5py
import pandas as pd
# Read a CSV file
# Here we assume col_dtypes is a dictionary that contains the dtypes of the columns
df = pd.read_table('./data.csv', sep='\t', dtype=col_dtypes)
# Transform the DataFrame into a structured numpy array and get the dtype
sa, saType = df_to_sarray(df)
# Open/create the HDF5 file
f = h5py.File('test.hdf5', 'a')
# Save the structured array
f.create_dataset('someData', data=sa, dtype=saType)
# Retrieve it and check it is ok when you transform it into a pandas DataFrame
sa2 = f['someData'][:]
df2 = pd.DataFrame(sa2)
print(df2.head())
f.close()
此外,通过这种方式,即使使用 gzip
压缩,您也可以从 HDFView 中看到它。
这是一个示例数据框:
import pandas as pd
NaN = float('nan')
ID = [1, 2, 3, 4, 5, 6, 7]
A = [NaN, NaN, NaN, 0.1, 0.1, 0.1, 0.1]
B = [0.2, NaN, 0.2, 0.2, 0.2, NaN, NaN]
C = [NaN, 0.5, 0.5, NaN, 0.5, 0.5, NaN]
columns = {'A':A, 'B':B, 'C':C}
df = pd.DataFrame(columns, index=ID)
df.index.name = 'ID'
print(df)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
我知道 pandas 有基于 HDFStore 的 pytables,这是一种高效 serialize/deserialize 数据框的简单方法。但是这些数据集不是很容易直接使用 reader h5py 或 matlab 加载。如何使用 h5py 保存数据框,以便我可以使用另一个 hdf5 轻松加载它 reader?
这是我解决这个问题的方法。我希望其他人有更好的解决方案或者我的方法对其他人有帮助。
首先,定义函数以从 pandas DataFrame 生成一个 numpy 结构数组(不是记录数组)。
import numpy as np
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
This is functionally equivalent to but more efficient than
np.array(df.to_array())
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
v = df.values
cols = df.columns
types = [(cols[i].encode(), df[k].dtype.type) for (i, k) in enumerate(cols)]
dtype = np.dtype(types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
z[k] = v[:, i]
return z
使用reset_index
创建一个新的数据框,将索引作为其数据的一部分。将该数据框转换为结构数组。
sa = df_to_sarray(df.reset_index())
sa
array([(1L, nan, 0.2, nan), (2L, nan, nan, 0.5), (3L, nan, 0.2, 0.5),
(4L, 0.1, 0.2, nan), (5L, 0.1, 0.2, 0.5), (6L, 0.1, nan, 0.5),
(7L, 0.1, nan, nan)],
dtype=[('ID', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
将该结构化数组保存到 hdf5 文件。
import h5py
with h5py.File('mydata.h5', 'w') as hf:
hf['df'] = sa
加载 h5 数据集
with h5py.File('mydata.h5') as hf:
sa2 = hf['df'][:]
提取 ID 列并将其从 sa2 中删除
import numpy.lib.recfunctions as nprec
ID = sa2['ID']
sa2 = nprec.drop_fields(sa2, 'ID')
使用sa2制作带有索引ID的数据框
df2 = pd.DataFrame(sa2, index=ID)
df2.index.name = 'ID'
print(df2)
A B C
ID
1 NaN 0.2 NaN
2 NaN NaN 0.5
3 NaN 0.2 0.5
4 0.1 0.2 NaN
5 0.1 0.2 0.5
6 0.1 NaN 0.5
7 0.1 NaN NaN
pandas HDFStore
格式是标准的 HDF5 格式,只是关于如何解释元数据的约定。文档是 here
In [54]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
In [55]: h = h5py.File('test.h5')
In [56]: h['df']['table']
Out[56]: <HDF5 dataset "table": shape (7,), type "|V32">
In [64]: h['df']['table'][:]
Out[64]:
array([(1, nan, 0.2, nan), (2, nan, nan, 0.5), (3, nan, 0.2, 0.5),
(4, 0.1, 0.2, nan), (5, 0.1, 0.2, 0.5), (6, 0.1, nan, 0.5),
(7, 0.1, nan, nan)],
dtype=[('index', '<i8'), ('A', '<f8'), ('B', '<f8'), ('C', '<f8')])
In [57]: h['df']['table'].attrs.items()
Out[57]:
[(u'CLASS', 'TABLE'),
(u'VERSION', '2.7'),
(u'TITLE', ''),
(u'FIELD_0_NAME', 'index'),
(u'FIELD_1_NAME', 'A'),
(u'FIELD_2_NAME', 'B'),
(u'FIELD_3_NAME', 'C'),
(u'FIELD_0_FILL', 0),
(u'FIELD_1_FILL', 0.0),
(u'FIELD_2_FILL', 0.0),
(u'FIELD_3_FILL', 0.0),
(u'index_kind', 'integer'),
(u'A_kind', "(lp1\nS'A'\na."),
(u'A_meta', 'N.'),
(u'A_dtype', 'float64'),
(u'B_kind', "(lp1\nS'B'\na."),
(u'B_meta', 'N.'),
(u'B_dtype', 'float64'),
(u'C_kind', "(lp1\nS'C'\na."),
(u'C_meta', 'N.'),
(u'C_dtype', 'float64'),
(u'NROWS', 7)]
In [58]: h.close()
数据将在任何 HDF5 reader 中完全可读。一些元数据被腌制,所以必须小心。
为了对大家有帮助,我采纳了
我主要针对 Strings
对其进行了改编,因为您无法将对象存储在 HDF5 文件中(我相信)。首先检查哪些列类型是 numpy objects
。然后检查哪个是该列的最长长度,并将该列固定为该长度的字符串。其余的和其他的很相似post.
def df_to_sarray(df):
"""
Convert a pandas DataFrame object to a numpy structured array.
Also, for every column of a str type, convert it into
a 'bytes' str literal of length = max(len(col)).
:param df: the data frame to convert
:return: a numpy structured array representation of df
"""
def make_col_type(col_type, col):
try:
if 'numpy.object_' in str(col_type.type):
maxlens = col.dropna().str.len()
if maxlens.any():
maxlen = maxlens.max().astype(int)
col_type = ('S%s' % maxlen, 1)
else:
col_type = 'f2'
return col.name, col_type
except:
print(col.name, col_type, col_type.type, type(col))
raise
v = df.values
types = df.dtypes
numpy_struct_types = [make_col_type(types[col], df.loc[:, col]) for col in df.columns]
dtype = np.dtype(numpy_struct_types)
z = np.zeros(v.shape[0], dtype)
for (i, k) in enumerate(z.dtype.names):
# This is in case you have problems with the encoding, remove the if branch if not
try:
if dtype[i].str.startswith('|S'):
z[k] = df[k].str.encode('latin').astype('S')
else:
z[k] = v[:, i]
except:
print(k, v[:, i])
raise
return z, dtype
所以工作流程是:
import h5py
import pandas as pd
# Read a CSV file
# Here we assume col_dtypes is a dictionary that contains the dtypes of the columns
df = pd.read_table('./data.csv', sep='\t', dtype=col_dtypes)
# Transform the DataFrame into a structured numpy array and get the dtype
sa, saType = df_to_sarray(df)
# Open/create the HDF5 file
f = h5py.File('test.hdf5', 'a')
# Save the structured array
f.create_dataset('someData', data=sa, dtype=saType)
# Retrieve it and check it is ok when you transform it into a pandas DataFrame
sa2 = f['someData'][:]
df2 = pd.DataFrame(sa2)
print(df2.head())
f.close()
此外,通过这种方式,即使使用 gzip
压缩,您也可以从 HDFView 中看到它。