将 Pandas DataFrames 保存为 HDF5 存储,各种错误
Saving Pandas DataFrames as a HDF5 store, various errors
只想在 HDF5 存储(.h5 文件)中归档一些 Pandas 数据帧。下面是我正在使用的代码。
# Fake data over N runs
Data_N = []
for n in range(5):
Data_N.append(np.random.randn(5000,15,125))
# Create HDFStore object
store = pd.HDFStore('test.h5')
# For each run:
for n in range(len(Data_N)):
Data = Data_N[n]
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
nanbuff = np.nan*np.zeros((1,len(Data[0,0])))
for i in range(len(Data)):
Data_i = np.vstack((nanbuff,Data[i,:,:]))
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
Data_rows = []
for i in range(len(Data)):
Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))
Data_DF.index = sum(Data_rows,[])
Data_DF.columns = range(1,len(Data[0,0])+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (n+1), Data_DF)
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)
# Save the HDF5 file
store.close()
这给出了以下输出:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]
如果我使用 put 的第二个版本,它给出:
TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.
如果我使用第三个版本的 put,它给出:
ValueError: cannot have non-object label DataIndexableCol
有人可以解释一下不同的版本吗,以及为什么我无法在没有酸洗的情况下将我认为有效的 Pandas DataFrame 保存在 HDF5 中?
如果有帮助,我认为我不需要能够附加 DataFrame/store。我只想要使用 Pandas HDF5 接口保存 DF 的最佳性能方式。
谢谢!
编辑 1:
我把"For each run:"之后的代码更新成这个
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
for i in range(len(Data)):
Data_i = Data[i,:,:]
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
L1 = np.zeros((l*m,1), dtype=object) # Layer number
L2 = np.zeros((l*m,1), dtype=object) # Row number
for i in range(l):
for j in range(m):
L1[i*m + j,0] = 'Layer %d' % (i+1)
L2[i*m + j,0] = '%d' % (j+1)
Data_DF.index = np.hstack((L1,L2))
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
但是对于每个 put 行,这会给出相同的警告或错误。
编辑 2(有效!):
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_DF = pd.DataFrame(Data.reshape(l*m,n))
# Layer and row labels
layers = np.arange(1,l+1)
rows = np.arange(1,m+1)
# Pandas multi-index
mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])
# DataFrame multi-index and column labels
Data_DF.index = mindex
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
第三行仍然给出同样的错误,但由于第二行有效,我假设第三行在这种情况下只是一个无效命令。
第二条线也比第一条线快很多,而且都比酸洗路线快得多。谢谢!
更新:
这是一个小演示:
设置:
data = np.random.randn(5,10,5)
index = pd.MultiIndex.from_product([np.arange(1, len(data)+1),
np.arange(1,len(data[0])+1)], names=['Layer','No'])
df = pd.DataFrame(data.reshape(data.shape[0] * data.shape[1], data.shape[2]),
index=index)
数据:
In [82]: df
Out[82]:
0 1 2 3 4
Layer No
1 1 1.167144 0.640303 0.059197 -1.637180 0.667196
2 2.150872 -0.825325 -0.332458 -1.307043 1.361330
3 -0.931299 -0.931882 0.153943 -0.446289 0.651594
4 -0.131500 -0.489745 1.264029 0.889779 1.081613
5 -0.479022 -1.516204 0.616170 0.126860 0.125559
6 1.114287 -0.939504 0.058869 0.321159 0.340881
7 -0.527516 -0.362337 -0.590430 -0.609017 1.835716
8 0.063372 0.000792 0.855485 -0.113592 0.890687
9 -0.160041 1.978954 0.778428 1.988354 2.095665
10 0.687911 0.115918 -0.653885 0.486365 -0.775659
2 1 -0.123350 0.674359 -0.120634 -1.350044 -0.176252
2 -1.986077 -0.846584 0.895982 0.236790 0.240023
3 0.878597 0.241594 0.405382 1.785109 1.228188
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
5 -1.424249 -0.183216 -0.044330 0.324894 -0.271179
6 -0.345720 -0.942421 0.538227 -0.558793 -1.075346
7 1.327952 -2.335520 -0.164645 1.489798 -0.876896
8 1.043723 0.770489 -1.052739 -0.830190 1.005406
9 0.789100 -0.706633 -1.014431 -1.164513 -0.266424
10 2.061175 0.933526 -1.601836 -1.542535 -1.220943
3 1 -0.061520 -0.932599 0.103480 -0.318529 -0.311965
2 -0.401409 -0.308739 -1.399233 -1.172032 -0.550774
3 0.670272 1.215724 0.711328 2.332297 -1.326704
4 0.377469 0.752313 -1.223832 0.431555 -0.901796
5 -2.386383 0.053921 -1.175427 -0.794099 -0.469374
6 0.951571 -2.220609 0.208136 -2.141828 0.010316
7 1.047133 0.924568 0.282091 1.367981 -0.617389
8 1.083008 -1.519416 0.535690 0.196885 -0.022692
9 1.307252 1.099716 0.766976 -0.466699 1.113605
10 -0.614214 0.702395 -0.131248 1.773092 0.241553
4 1 -1.280026 0.278248 -0.518560 -0.395394 0.434473
2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
3 -0.539773 -0.755483 -1.002526 0.198792 -0.120656
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
5 -0.777678 1.734322 -1.270129 0.160094 0.355290
6 -1.037775 -0.542944 -0.913428 0.885965 -0.155220
7 -0.855498 -0.330268 -1.903738 0.098101 -0.670830
8 0.786258 0.599100 -0.426781 0.425572 0.132932
9 -0.430497 -1.414292 -0.997637 0.696176 -0.480886
10 1.211665 -1.233842 0.137176 1.520013 -1.052884
5 1 -0.267698 -1.013917 -1.324896 -1.189835 -0.192396
2 1.047264 -0.454829 1.051039 1.565423 0.749844
3 0.159177 0.481088 0.711499 -1.217079 0.444402
4 0.254420 -0.114102 0.620231 1.890822 1.269808
5 0.673696 -0.321638 -0.887355 0.426549 -0.935591
6 -1.836808 0.450332 1.187512 -0.215318 -1.142346
7 -1.496568 0.633886 0.625143 0.295385 1.445084
8 -0.473427 -0.608318 -0.602080 0.134105 0.704027
9 2.319899 0.763272 0.861798 1.464612 -0.708869
10 -0.199555 0.721122 0.099777 -0.466488 0.923112
In [84]: df.index.levels
Out[84]: FrozenList([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
现在你可以像下面这样切片:
In [85]: idx = pd.IndexSlice
In [86]: df.loc[idx[[2,4], 2:5], :]
Out[86]:
0 1 2 3 4
Layer No
2 2 -1.986077 -0.846584 0.895982 0.236790 0.240023
3 0.878597 0.241594 0.405382 1.785109 1.228188
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
5 -1.424249 -0.183216 -0.044330 0.324894 -0.271179
4 2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
3 -0.539773 -0.755483 -1.002526 0.198792 -0.120656
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
5 -0.777678 1.734322 -1.270129 0.160094 0.355290
保存到 HDF 存储并从中选择:
In [88]: store = pd.HDFStore('d:/temp/test.h5')
In [89]: store.append('test', df, complib='blosc', complevel=5)
In [90]: store.close()
In [91]: store = pd.HDFStore('d:/temp/test.h5')
In [92]: store.select('test', where="Layer in [2,4] and No in [2,4,6]")
Out[92]:
0 1 2 3 4
Layer No
2 2 -1.986077 -0.846584 0.895982 0.236790 0.240023
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
6 -0.345720 -0.942421 0.538227 -0.558793 -1.075346
4 2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
6 -1.037775 -0.542944 -0.913428 0.885965 -0.155220
MultiIndex documentation(有两个级别:Layer
、No
)。
只想在 HDF5 存储(.h5 文件)中归档一些 Pandas 数据帧。下面是我正在使用的代码。
# Fake data over N runs
Data_N = []
for n in range(5):
Data_N.append(np.random.randn(5000,15,125))
# Create HDFStore object
store = pd.HDFStore('test.h5')
# For each run:
for n in range(len(Data_N)):
Data = Data_N[n]
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
nanbuff = np.nan*np.zeros((1,len(Data[0,0])))
for i in range(len(Data)):
Data_i = np.vstack((nanbuff,Data[i,:,:]))
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
Data_rows = []
for i in range(len(Data)):
Data_rows.append(['Layer %d:' % (i+1)] + range(1,len(Data[0])+1))
Data_DF.index = sum(Data_rows,[])
Data_DF.columns = range(1,len(Data[0,0])+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (n+1), Data_DF)
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (n+1), Data_DF, format='table', data_columns=True)
# Save the HDF5 file
store.close()
这给出了以下输出:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->axis1] [items->None]
如果我使用 put 的第二个版本,它给出:
TypeError: Passing an incorrect value to a table column. Expected a Col (or subc
lass) instance and got: "ObjectAtom()". Please make use of the Col(), or descend
ant, constructor to properly initialize columns.
如果我使用第三个版本的 put,它给出:
ValueError: cannot have non-object label DataIndexableCol
有人可以解释一下不同的版本吗,以及为什么我无法在没有酸洗的情况下将我认为有效的 Pandas DataFrame 保存在 HDF5 中?
如果有帮助,我认为我不需要能够附加 DataFrame/store。我只想要使用 Pandas HDF5 接口保存 DF 的最佳性能方式。
谢谢!
编辑 1:
我把"For each run:"之后的代码更新成这个
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_subDFs = []
for i in range(len(Data)):
Data_i = Data[i,:,:]
Data_subDFs.append(pd.DataFrame(data = Data_i))
Data_DF = pd.concat(Data_subDFs)
# Row and column labels for the DataFrame
L1 = np.zeros((l*m,1), dtype=object) # Layer number
L2 = np.zeros((l*m,1), dtype=object) # Row number
for i in range(l):
for j in range(m):
L1[i*m + j,0] = 'Layer %d' % (i+1)
L2[i*m + j,0] = '%d' % (j+1)
Data_DF.index = np.hstack((L1,L2))
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
但是对于每个 put 行,这会给出相同的警告或错误。
编辑 2(有效!):
# For each run:
for run in range(len(Data_N)):
Data = Data_N[run]
l = len(Data)
m = len(Data[0])
n = len(Data[0,0])
# Pandas DataFrame for "flattened" fake data
Data_DF = pd.DataFrame(Data.reshape(l*m,n))
# Layer and row labels
layers = np.arange(1,l+1)
rows = np.arange(1,m+1)
# Pandas multi-index
mindex = pd.MultiIndex.from_product([layers,rows], names=['Layer','Row'])
# DataFrame multi-index and column labels
Data_DF.index = mindex
Data_DF.columns = range(1,n+1)
# Put Pandas DataFrame into store
store.put('Data_DF_%d' % (run+1), Data_DF)
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table')
#store.put('Data_DF_%d' % (run+1), Data_DF, format='table', data_columns=True)
第三行仍然给出同样的错误,但由于第二行有效,我假设第三行在这种情况下只是一个无效命令。
第二条线也比第一条线快很多,而且都比酸洗路线快得多。谢谢!
更新:
这是一个小演示:
设置:
data = np.random.randn(5,10,5)
index = pd.MultiIndex.from_product([np.arange(1, len(data)+1),
np.arange(1,len(data[0])+1)], names=['Layer','No'])
df = pd.DataFrame(data.reshape(data.shape[0] * data.shape[1], data.shape[2]),
index=index)
数据:
In [82]: df
Out[82]:
0 1 2 3 4
Layer No
1 1 1.167144 0.640303 0.059197 -1.637180 0.667196
2 2.150872 -0.825325 -0.332458 -1.307043 1.361330
3 -0.931299 -0.931882 0.153943 -0.446289 0.651594
4 -0.131500 -0.489745 1.264029 0.889779 1.081613
5 -0.479022 -1.516204 0.616170 0.126860 0.125559
6 1.114287 -0.939504 0.058869 0.321159 0.340881
7 -0.527516 -0.362337 -0.590430 -0.609017 1.835716
8 0.063372 0.000792 0.855485 -0.113592 0.890687
9 -0.160041 1.978954 0.778428 1.988354 2.095665
10 0.687911 0.115918 -0.653885 0.486365 -0.775659
2 1 -0.123350 0.674359 -0.120634 -1.350044 -0.176252
2 -1.986077 -0.846584 0.895982 0.236790 0.240023
3 0.878597 0.241594 0.405382 1.785109 1.228188
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
5 -1.424249 -0.183216 -0.044330 0.324894 -0.271179
6 -0.345720 -0.942421 0.538227 -0.558793 -1.075346
7 1.327952 -2.335520 -0.164645 1.489798 -0.876896
8 1.043723 0.770489 -1.052739 -0.830190 1.005406
9 0.789100 -0.706633 -1.014431 -1.164513 -0.266424
10 2.061175 0.933526 -1.601836 -1.542535 -1.220943
3 1 -0.061520 -0.932599 0.103480 -0.318529 -0.311965
2 -0.401409 -0.308739 -1.399233 -1.172032 -0.550774
3 0.670272 1.215724 0.711328 2.332297 -1.326704
4 0.377469 0.752313 -1.223832 0.431555 -0.901796
5 -2.386383 0.053921 -1.175427 -0.794099 -0.469374
6 0.951571 -2.220609 0.208136 -2.141828 0.010316
7 1.047133 0.924568 0.282091 1.367981 -0.617389
8 1.083008 -1.519416 0.535690 0.196885 -0.022692
9 1.307252 1.099716 0.766976 -0.466699 1.113605
10 -0.614214 0.702395 -0.131248 1.773092 0.241553
4 1 -1.280026 0.278248 -0.518560 -0.395394 0.434473
2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
3 -0.539773 -0.755483 -1.002526 0.198792 -0.120656
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
5 -0.777678 1.734322 -1.270129 0.160094 0.355290
6 -1.037775 -0.542944 -0.913428 0.885965 -0.155220
7 -0.855498 -0.330268 -1.903738 0.098101 -0.670830
8 0.786258 0.599100 -0.426781 0.425572 0.132932
9 -0.430497 -1.414292 -0.997637 0.696176 -0.480886
10 1.211665 -1.233842 0.137176 1.520013 -1.052884
5 1 -0.267698 -1.013917 -1.324896 -1.189835 -0.192396
2 1.047264 -0.454829 1.051039 1.565423 0.749844
3 0.159177 0.481088 0.711499 -1.217079 0.444402
4 0.254420 -0.114102 0.620231 1.890822 1.269808
5 0.673696 -0.321638 -0.887355 0.426549 -0.935591
6 -1.836808 0.450332 1.187512 -0.215318 -1.142346
7 -1.496568 0.633886 0.625143 0.295385 1.445084
8 -0.473427 -0.608318 -0.602080 0.134105 0.704027
9 2.319899 0.763272 0.861798 1.464612 -0.708869
10 -0.199555 0.721122 0.099777 -0.466488 0.923112
In [84]: df.index.levels
Out[84]: FrozenList([[1, 2, 3, 4, 5], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]])
现在你可以像下面这样切片:
In [85]: idx = pd.IndexSlice
In [86]: df.loc[idx[[2,4], 2:5], :]
Out[86]:
0 1 2 3 4
Layer No
2 2 -1.986077 -0.846584 0.895982 0.236790 0.240023
3 0.878597 0.241594 0.405382 1.785109 1.228188
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
5 -1.424249 -0.183216 -0.044330 0.324894 -0.271179
4 2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
3 -0.539773 -0.755483 -1.002526 0.198792 -0.120656
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
5 -0.777678 1.734322 -1.270129 0.160094 0.355290
保存到 HDF 存储并从中选择:
In [88]: store = pd.HDFStore('d:/temp/test.h5')
In [89]: store.append('test', df, complib='blosc', complevel=5)
In [90]: store.close()
In [91]: store = pd.HDFStore('d:/temp/test.h5')
In [92]: store.select('test', where="Layer in [2,4] and No in [2,4,6]")
Out[92]:
0 1 2 3 4
Layer No
2 2 -1.986077 -0.846584 0.895982 0.236790 0.240023
4 -1.510238 -0.303274 0.247082 1.841996 -0.864595
6 -0.345720 -0.942421 0.538227 -0.558793 -1.075346
4 2 1.498882 -1.359542 0.012312 -0.231728 -2.643232
4 0.056788 1.289477 -0.440122 -1.454418 -0.043193
6 -1.037775 -0.542944 -0.913428 0.885965 -0.155220
MultiIndex documentation(有两个级别:Layer
、No
)。