使用块创建数据框字典

Create a dictionary of dataframe using chunks

我有一个 df 类型的数据框

        permno       date time_avail_m  ...  OperProfRD_q  _merge       ret
100000   11167 1989-01-31       1989m1  ...           NaN    both -0.170732
100001   11167 1989-02-28       1989m2  ...           NaN    both -0.088235
100002   11167 1989-03-31       1989m3  ...           NaN    both -0.064516
100003   11167 1989-05-31       1989m5  ...           NaN    both  0.181818
100004   11167 1989-06-30       1989m6  ...           NaN    both  0.179487

df.info()的结果是

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 10000 to 19999
Columns: 320 entries, permno to ret
dtypes: datetime64[ns](1), float64(304), int64(13), object(2)
memory usage: 24.4+ MB
None

这是通过 运行 df.head 分块循环遍历我的数据帧 df 获得的输出。 我需要创建一个数据框字典,其中字典键是 date 列中的值,键是索引为 permno 的数据框,df 的其余列作为列。有没有一种有效的方法来做到这一点?我想分块执行此操作,因为 df 是一个非常大的数据库

下面是一个示例,说明如何对 out-of-memory 数据实施 groupby 操作,以块的形式读取数据。

示例数据

import pandas as pd

file = 'C:/users/ricar/downloads/mushrooms.csv' # downloaded from kaggle

# df = pd.read_csv(file, nrows=2)
# df.info()
# Data columns (total 23 columns):
 # #   Column                    Non-Null Count  Dtype
# ---  ------                    --------------  -----
 # 0   class                     2 non-null      object
 # 1   cap-shape                 2 non-null      object
 # 2   cap-surface               2 non-null      object
 # 3   cap-color                 2 non-null      object
 # 4   bruises                   2 non-null      object
 # 5   odor                      2 non-null      object
 # 6   gill-attachment           2 non-null      object
 # 7   gill-spacing              2 non-null      object
 # 8   gill-size                 2 non-null      object
 # 9   gill-color                2 non-null      object
 # 10  stalk-shape               2 non-null      object
 # 11  stalk-root                2 non-null      object
 # 12  stalk-surface-above-ring  2 non-null      object
 # 13  stalk-surface-below-ring  2 non-null      object
 # 14  stalk-color-above-ring    2 non-null      object
 # 15  stalk-color-below-ring    2 non-null      object
 # 16  veil-type                 2 non-null      object
 # 17  veil-color                2 non-null      object
 # 18  ring-number               2 non-null      object
 # 19  ring-type                 2 non-null      object
 # 20  spore-print-color         2 non-null      object
 # 21  population                2 non-null      object
 # 22  habitat                   2 non-null      object
# dtypes: object(23)
# memory usage: 496.0+ bytes

建造石斑鱼

from collections import defaultdict

# pick your pivot columns
idx = 'cap-shape'
grouper = ['cap-surface']

# populate the grouper
groups = defaultdict(list)
for chunk in pd.read_csv(file, usecols=grouper, chunksize=1000):
    chunk = chunk.reset_index().set_index(grouper).squeeze()
    for key, g in chunk.groupby(chunk.index):
        groups[key].extend(g.to_list())

使用它来过滤以块形式加载的数据

# load a single sub-dataframe    
def load_subdf(key, **kwargs):
    out = []
    for chunk in pd.read_csv(file, **kwargs):
        out.append(chunk[chunk[grouper[0]].eq(key)])
    return pd.concat(out).drop(columns=grouper)

df_f = load_subdf('f', index_col=idx, chunksize=1000)

输出

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2320 entries, x to k
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
 0   class                     2320 non-null   object
 1   cap-color                 2320 non-null   object
 2   bruises                   2320 non-null   object
 3   odor                      2320 non-null   object
 4   gill-attachment           2320 non-null   object
 5   gill-spacing              2320 non-null   object
 6   gill-size                 2320 non-null   object
 7   gill-color                2320 non-null   object
 8   stalk-shape               2320 non-null   object
 9   stalk-root                2320 non-null   object
 10  stalk-surface-above-ring  2320 non-null   object
 11  stalk-surface-below-ring  2320 non-null   object
 12  stalk-color-above-ring    2320 non-null   object
 13  stalk-color-below-ring    2320 non-null   object
 14  veil-type                 2320 non-null   object
 15  veil-color                2320 non-null   object
 16  ring-number               2320 non-null   object
 17  ring-type                 2320 non-null   object
 18  spore-print-color         2320 non-null   object
 19  population                2320 non-null   object
 20  habitat                   2320 non-null   object
dtypes: object(21)
memory usage: 398.8+ KB

注意索引不再是默认范围索引,并且 grouper 列不是结果的一部分。


第一个回答:

您的数据框足够小,可以 in-memory 重塑...尝试以下操作

df = df.set_index('permno') # discard current index
dict_dfs = {date: gdf for date, gdf in df.groupby('date')}