如何在 Pandas 中的超大数据帧上创建枢轴 table
How to create a pivot table on extremely large dataframes in Pandas
我需要从大约 6000 万行的数据集中创建一个 2000 列乘以大约 30-5000 万行的数据透视表 table。我试过在 100,000 行的块中旋转,这很有效,但是当我尝试通过执行 .append() 后跟 .groupby('someKey').sum() 来重新组合 DataFrames 时,我所有的记忆都是占用并 python 最终崩溃。
如何使用有限的 RAM 对如此大的数据进行数据透视?
编辑:添加示例代码
以下代码一路包含各种测试输出,但最后的打印结果才是我们真正感兴趣的。请注意,如果我们将 segMax 更改为 3 而不是 4,代码将产生误报正确的输出。主要问题是,如果 shipmentid 条目不在 sum(wawa) 查看的每个块中,它就不会显示在输出中。
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
您可以使用 HDF5/pytables 进行附加。这使它远离 RAM。
使用table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
现在您可以将其作为 DataFrame 一次性读入(假设此 DataFrame 可以放入内存!):
df = store['df']
您也可以查询,以仅获取 DataFrame 的子部分。
旁白:你还应该买更多的内存,它很便宜。
编辑:您可以 groupby/sum 从商店 iteratively 因为这个 "map-reduces" 在块上:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2:在 pandas 0.16 中使用 sum 实际上不起作用(我认为它在 0.15.2 中有效),相反你可以使用 reduce
with add
:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
在 python 3 你必须 import reduce from functools.
也许更pythonic/readable写成:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
如果性能不佳/如果有大量新组,那么最好将 res 启动为正确大小的零(通过获取唯一的组键,例如通过循环块),然后就地添加。
我需要从大约 6000 万行的数据集中创建一个 2000 列乘以大约 30-5000 万行的数据透视表 table。我试过在 100,000 行的块中旋转,这很有效,但是当我尝试通过执行 .append() 后跟 .groupby('someKey').sum() 来重新组合 DataFrames 时,我所有的记忆都是占用并 python 最终崩溃。
如何使用有限的 RAM 对如此大的数据进行数据透视?
编辑:添加示例代码
以下代码一路包含各种测试输出,但最后的打印结果才是我们真正感兴趣的。请注意,如果我们将 segMax 更改为 3 而不是 4,代码将产生误报正确的输出。主要问题是,如果 shipmentid 条目不在 sum(wawa) 查看的每个块中,它就不会显示在输出中。
import pandas as pd
import numpy as np
import random
from pandas.io.pytables import *
import os
pd.set_option('io.hdf.default_format','table')
# create a small dataframe to simulate the real data.
def loadFrame():
frame = pd.DataFrame()
frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes
frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test
frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test
return frame
def pivotSegment(segmentNumber,passedFrame):
segmentSize = 3 #take 3 rows at a time
frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF
# ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values
span = pd.DataFrame()
span['catid'] = range(1,5+1)
span['shipmentid']=1
span['qty']=0
frame = frame.append(span)
return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \
aggfunc='sum',fill_value=0).reset_index()
def createStore():
store = pd.HDFStore('testdata.h5')
return store
segMin = 0
segMax = 4
store = createStore()
frame = loadFrame()
print('Printing Frame')
print(frame)
print(frame.info())
for i in range(segMin,segMax):
segment = pivotSegment(i,frame)
store.append('data',frame[(i*3):(i*3 + 3)])
store.append('pivotedData',segment)
print('\nPrinting Store')
print(store)
print('\nPrinting Store: data')
print(store['data'])
print('\nPrinting Store: pivotedData')
print(store['pivotedData'])
print('**************')
print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('**************')
print('$$$')
for df in store.select('pivotedData',chunksize=3):
print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum())
print('$$$')
store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3)))
print('\nPrinting Store: pivotedAndSummed')
print(store['pivotedAndSummed'])
store.close()
os.remove('testdata.h5')
print('closed')
您可以使用 HDF5/pytables 进行附加。这使它远离 RAM。
使用table format:
store = pd.HDFStore('store.h5')
for ...:
...
chunk # the chunk of the DataFrame (which you want to append)
store.append('df', chunk)
现在您可以将其作为 DataFrame 一次性读入(假设此 DataFrame 可以放入内存!):
df = store['df']
您也可以查询,以仅获取 DataFrame 的子部分。
旁白:你还应该买更多的内存,它很便宜。
编辑:您可以 groupby/sum 从商店 iteratively 因为这个 "map-reduces" 在块上:
# note: this doesn't work, see below
sum(df.groupby().sum() for df in store.select('df', chunksize=50000))
# equivalent to (but doesn't read in the entire frame)
store['df'].groupby().sum()
Edit2:在 pandas 0.16 中使用 sum 实际上不起作用(我认为它在 0.15.2 中有效),相反你可以使用 reduce
with add
:
reduce(lambda x, y: x.add(y, fill_value=0),
(df.groupby().sum() for df in store.select('df', chunksize=50000)))
在 python 3 你必须 import reduce from functools.
也许更pythonic/readable写成:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000))
res = next(chunks) # will raise if there are no chunks!
for c in chunks:
res = res.add(c, fill_value=0)
如果性能不佳/如果有大量新组,那么最好将 res 启动为正确大小的零(通过获取唯一的组键,例如通过循环块),然后就地添加。