从 HDFStore 检索多索引 Pandas DataFrame 时遇到问题(Table 格式)
Trouble retrieving multi-index Pandas DataFrame from HDFStore (in Table format)
我有一些代码,简化为下面包含的示例,它获取一些原始数据,从中创建一个数据透视表 table,然后将其与另一个数据帧合并,最后将结果存储在 HDFStore 对象中.如果它以固定格式存储,则可以很好地检索它。但是,如果以 Table 格式存储,则会产生错误。我需要 table 格式,以便我可以一次提取块(总数据集是数千万行)。
知道问题出在哪里吗?
代码示例:
import pandas as pd
def createFrame():
data = {
'colA':[1,1,1,2,2,2,3,3,3,4,4,4,4],
'colB':[5,5,5,5,5,5,5,5,5,5,5,5,5],
'colC':['a','b','c','a','b','c','a','b','c','d','e','f','g'],
'colD':['ap','ap','ap','bp','bp','bp','cp','cp','cp','dp','dp','dp','dp']
}
frame = pd.DataFrame(data)
return frame
def createOtherFrame():
data = {
'colD':['ap','bp','cp','dp'],
'colE':[100,200,300,400]
}
frame = pd.DataFrame(data)
return frame
if __name__ == '__main__':
pd.set_option('display.width', 120) # default is 80
pd.set_option('io.hdf.default_format','table')
pivotStore = pd.HDFStore('test.h5')
frame = createFrame()
otherFrame = createOtherFrame()
pivoted = frame.pivot_table(['colB'],
index=['colA'],
columns='colC',
aggfunc='sum'
)
print(pivoted)
print('')
# print(pivoted.info(),end='\n\n')
mergedFrameA = pd.merge(frame[['colA','colD']].drop_duplicates(),
otherFrame,
how = 'left',
on='colD'
).set_index('colA')
# print(mergedFrameA.info())
print(mergedFrameA)
mergedFrameB = pd.merge(pivoted,mergedFrameA,how='left',left_index=True,right_index=True)
# print(mergedFrameB.info())
print(mergedFrameB)
pivotStore['bob'] = mergedFrameB
pivotStore.close()
store = pd.HDFStore('test.h5')
extracted = store.select('bob',start=0,stop=5)
print(extracted)
store.close()
产生的输出(有错误):
colB
colC a b c d e f g
colA
1 5 5 5 NaN NaN NaN NaN
2 5 5 5 NaN NaN NaN NaN
3 5 5 5 NaN NaN NaN NaN
4 NaN NaN NaN 5 5 5 5
colD colE
colA
1 ap 100
2 bp 200
3 cp 300
4 dp 400
(colB, a) (colB, b) (colB, c) (colB, d) (colB, e) (colB, f) (colB, g) colD colE
colA
1 5 5 5 NaN NaN NaN NaN ap 100
2 5 5 5 NaN NaN NaN NaN bp 200
3 5 5 5 NaN NaN NaN NaN cp 300
4 NaN NaN NaN 5 5 5 5 dp 400
Traceback (most recent call last):
File "C:\multiindextest.py", line 52, in <module>
extracted = store.select('bob',start=0,stop=5)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 3999, in read
cols.set_names(names, inplace=True)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 529, in set_names
idx._set_names(names, level=level)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 3274, in _set_names
'Length of names must match number of levels in MultiIndex.')
ValueError: Length of names must match number of levels in MultiIndex.
Closing remaining open files:test.h5...done
你不能存储这样的索引,它部分是元组部分是字符串。它不是真正的 MultiIndex(也没有任何用处)。
我会简单地使用常规索引。您也可以使用 MultiIndex 执行此操作,但每一列都必须是它的一部分。
In [67]: pivoted = frame.pivot_table('colB',index='colA',columns='colC',aggfunc='sum')
In [68]: pivoted
Out[68]:
colC a b c d e f g
colA
1 5 5 5 NaN NaN NaN NaN
2 5 5 5 NaN NaN NaN NaN
3 5 5 5 NaN NaN NaN NaN
4 NaN NaN NaN 5 5 5 5
In [69]: df = pd.concat([pivoted,mergedFrameA],axis=1)
In [70]: df
Out[70]:
colC a b c d e f g colD colE
colA
1 5 5 5 NaN NaN NaN NaN ap 100
2 5 5 5 NaN NaN NaN NaN bp 200
3 5 5 5 NaN NaN NaN NaN cp 300
4 NaN NaN NaN 5 5 5 5 dp 400
我有一些代码,简化为下面包含的示例,它获取一些原始数据,从中创建一个数据透视表 table,然后将其与另一个数据帧合并,最后将结果存储在 HDFStore 对象中.如果它以固定格式存储,则可以很好地检索它。但是,如果以 Table 格式存储,则会产生错误。我需要 table 格式,以便我可以一次提取块(总数据集是数千万行)。
知道问题出在哪里吗?
代码示例:
import pandas as pd
def createFrame():
data = {
'colA':[1,1,1,2,2,2,3,3,3,4,4,4,4],
'colB':[5,5,5,5,5,5,5,5,5,5,5,5,5],
'colC':['a','b','c','a','b','c','a','b','c','d','e','f','g'],
'colD':['ap','ap','ap','bp','bp','bp','cp','cp','cp','dp','dp','dp','dp']
}
frame = pd.DataFrame(data)
return frame
def createOtherFrame():
data = {
'colD':['ap','bp','cp','dp'],
'colE':[100,200,300,400]
}
frame = pd.DataFrame(data)
return frame
if __name__ == '__main__':
pd.set_option('display.width', 120) # default is 80
pd.set_option('io.hdf.default_format','table')
pivotStore = pd.HDFStore('test.h5')
frame = createFrame()
otherFrame = createOtherFrame()
pivoted = frame.pivot_table(['colB'],
index=['colA'],
columns='colC',
aggfunc='sum'
)
print(pivoted)
print('')
# print(pivoted.info(),end='\n\n')
mergedFrameA = pd.merge(frame[['colA','colD']].drop_duplicates(),
otherFrame,
how = 'left',
on='colD'
).set_index('colA')
# print(mergedFrameA.info())
print(mergedFrameA)
mergedFrameB = pd.merge(pivoted,mergedFrameA,how='left',left_index=True,right_index=True)
# print(mergedFrameB.info())
print(mergedFrameB)
pivotStore['bob'] = mergedFrameB
pivotStore.close()
store = pd.HDFStore('test.h5')
extracted = store.select('bob',start=0,stop=5)
print(extracted)
store.close()
产生的输出(有错误):
colB
colC a b c d e f g
colA
1 5 5 5 NaN NaN NaN NaN
2 5 5 5 NaN NaN NaN NaN
3 5 5 5 NaN NaN NaN NaN
4 NaN NaN NaN 5 5 5 5
colD colE
colA
1 ap 100
2 bp 200
3 cp 300
4 dp 400
(colB, a) (colB, b) (colB, c) (colB, d) (colB, e) (colB, f) (colB, g) colD colE
colA
1 5 5 5 NaN NaN NaN NaN ap 100
2 5 5 5 NaN NaN NaN NaN bp 200
3 5 5 5 NaN NaN NaN NaN cp 300
4 NaN NaN NaN 5 5 5 5 dp 400
Traceback (most recent call last):
File "C:\multiindextest.py", line 52, in <module>
extracted = store.select('bob',start=0,stop=5)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 665, in select
return it.get_result()
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 1359, in get_result
results = self.func(self.start, self.stop, where)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 658, in func
columns=columns, **kwargs)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\io\pytables.py", line 3999, in read
cols.set_names(names, inplace=True)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 529, in set_names
idx._set_names(names, level=level)
File "C:\Anaconda3\envs\py34\lib\site-packages\pandas\core\index.py", line 3274, in _set_names
'Length of names must match number of levels in MultiIndex.')
ValueError: Length of names must match number of levels in MultiIndex.
Closing remaining open files:test.h5...done
你不能存储这样的索引,它部分是元组部分是字符串。它不是真正的 MultiIndex(也没有任何用处)。
我会简单地使用常规索引。您也可以使用 MultiIndex 执行此操作,但每一列都必须是它的一部分。
In [67]: pivoted = frame.pivot_table('colB',index='colA',columns='colC',aggfunc='sum')
In [68]: pivoted
Out[68]:
colC a b c d e f g
colA
1 5 5 5 NaN NaN NaN NaN
2 5 5 5 NaN NaN NaN NaN
3 5 5 5 NaN NaN NaN NaN
4 NaN NaN NaN 5 5 5 5
In [69]: df = pd.concat([pivoted,mergedFrameA],axis=1)
In [70]: df
Out[70]:
colC a b c d e f g colD colE
colA
1 5 5 5 NaN NaN NaN NaN ap 100
2 5 5 5 NaN NaN NaN NaN bp 200
3 5 5 5 NaN NaN NaN NaN cp 300
4 NaN NaN NaN 5 5 5 5 dp 400