使用块创建数据框字典
Create a dictionary of dataframe using chunks
我有一个 df
类型的数据框
permno date time_avail_m ... OperProfRD_q _merge ret
100000 11167 1989-01-31 1989m1 ... NaN both -0.170732
100001 11167 1989-02-28 1989m2 ... NaN both -0.088235
100002 11167 1989-03-31 1989m3 ... NaN both -0.064516
100003 11167 1989-05-31 1989m5 ... NaN both 0.181818
100004 11167 1989-06-30 1989m6 ... NaN both 0.179487
df.info()
的结果是
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 10000 to 19999
Columns: 320 entries, permno to ret
dtypes: datetime64[ns](1), float64(304), int64(13), object(2)
memory usage: 24.4+ MB
None
这是通过 运行 df.head
分块循环遍历我的数据帧 df
获得的输出。
我需要创建一个数据框字典,其中字典键是 date
列中的值,键是索引为 permno
的数据框,df
的其余列作为列。有没有一种有效的方法来做到这一点?我想分块执行此操作,因为 df
是一个非常大的数据库
下面是一个示例,说明如何对 out-of-memory 数据实施 groupby
操作,以块的形式读取数据。
示例数据
import pandas as pd
file = 'C:/users/ricar/downloads/mushrooms.csv' # downloaded from kaggle
# df = pd.read_csv(file, nrows=2)
# df.info()
# Data columns (total 23 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 class 2 non-null object
# 1 cap-shape 2 non-null object
# 2 cap-surface 2 non-null object
# 3 cap-color 2 non-null object
# 4 bruises 2 non-null object
# 5 odor 2 non-null object
# 6 gill-attachment 2 non-null object
# 7 gill-spacing 2 non-null object
# 8 gill-size 2 non-null object
# 9 gill-color 2 non-null object
# 10 stalk-shape 2 non-null object
# 11 stalk-root 2 non-null object
# 12 stalk-surface-above-ring 2 non-null object
# 13 stalk-surface-below-ring 2 non-null object
# 14 stalk-color-above-ring 2 non-null object
# 15 stalk-color-below-ring 2 non-null object
# 16 veil-type 2 non-null object
# 17 veil-color 2 non-null object
# 18 ring-number 2 non-null object
# 19 ring-type 2 non-null object
# 20 spore-print-color 2 non-null object
# 21 population 2 non-null object
# 22 habitat 2 non-null object
# dtypes: object(23)
# memory usage: 496.0+ bytes
建造石斑鱼
from collections import defaultdict
# pick your pivot columns
idx = 'cap-shape'
grouper = ['cap-surface']
# populate the grouper
groups = defaultdict(list)
for chunk in pd.read_csv(file, usecols=grouper, chunksize=1000):
chunk = chunk.reset_index().set_index(grouper).squeeze()
for key, g in chunk.groupby(chunk.index):
groups[key].extend(g.to_list())
使用它来过滤以块形式加载的数据
# load a single sub-dataframe
def load_subdf(key, **kwargs):
out = []
for chunk in pd.read_csv(file, **kwargs):
out.append(chunk[chunk[grouper[0]].eq(key)])
return pd.concat(out).drop(columns=grouper)
df_f = load_subdf('f', index_col=idx, chunksize=1000)
输出
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2320 entries, x to k
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 2320 non-null object
1 cap-color 2320 non-null object
2 bruises 2320 non-null object
3 odor 2320 non-null object
4 gill-attachment 2320 non-null object
5 gill-spacing 2320 non-null object
6 gill-size 2320 non-null object
7 gill-color 2320 non-null object
8 stalk-shape 2320 non-null object
9 stalk-root 2320 non-null object
10 stalk-surface-above-ring 2320 non-null object
11 stalk-surface-below-ring 2320 non-null object
12 stalk-color-above-ring 2320 non-null object
13 stalk-color-below-ring 2320 non-null object
14 veil-type 2320 non-null object
15 veil-color 2320 non-null object
16 ring-number 2320 non-null object
17 ring-type 2320 non-null object
18 spore-print-color 2320 non-null object
19 population 2320 non-null object
20 habitat 2320 non-null object
dtypes: object(21)
memory usage: 398.8+ KB
注意索引不再是默认范围索引,并且 grouper 列不是结果的一部分。
第一个回答:
您的数据框足够小,可以 in-memory 重塑...尝试以下操作
df = df.set_index('permno') # discard current index
dict_dfs = {date: gdf for date, gdf in df.groupby('date')}
我有一个 df
类型的数据框
permno date time_avail_m ... OperProfRD_q _merge ret
100000 11167 1989-01-31 1989m1 ... NaN both -0.170732
100001 11167 1989-02-28 1989m2 ... NaN both -0.088235
100002 11167 1989-03-31 1989m3 ... NaN both -0.064516
100003 11167 1989-05-31 1989m5 ... NaN both 0.181818
100004 11167 1989-06-30 1989m6 ... NaN both 0.179487
df.info()
的结果是
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 10000 to 19999
Columns: 320 entries, permno to ret
dtypes: datetime64[ns](1), float64(304), int64(13), object(2)
memory usage: 24.4+ MB
None
这是通过 运行 df.head
分块循环遍历我的数据帧 df
获得的输出。
我需要创建一个数据框字典,其中字典键是 date
列中的值,键是索引为 permno
的数据框,df
的其余列作为列。有没有一种有效的方法来做到这一点?我想分块执行此操作,因为 df
是一个非常大的数据库
下面是一个示例,说明如何对 out-of-memory 数据实施 groupby
操作,以块的形式读取数据。
示例数据
import pandas as pd
file = 'C:/users/ricar/downloads/mushrooms.csv' # downloaded from kaggle
# df = pd.read_csv(file, nrows=2)
# df.info()
# Data columns (total 23 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 class 2 non-null object
# 1 cap-shape 2 non-null object
# 2 cap-surface 2 non-null object
# 3 cap-color 2 non-null object
# 4 bruises 2 non-null object
# 5 odor 2 non-null object
# 6 gill-attachment 2 non-null object
# 7 gill-spacing 2 non-null object
# 8 gill-size 2 non-null object
# 9 gill-color 2 non-null object
# 10 stalk-shape 2 non-null object
# 11 stalk-root 2 non-null object
# 12 stalk-surface-above-ring 2 non-null object
# 13 stalk-surface-below-ring 2 non-null object
# 14 stalk-color-above-ring 2 non-null object
# 15 stalk-color-below-ring 2 non-null object
# 16 veil-type 2 non-null object
# 17 veil-color 2 non-null object
# 18 ring-number 2 non-null object
# 19 ring-type 2 non-null object
# 20 spore-print-color 2 non-null object
# 21 population 2 non-null object
# 22 habitat 2 non-null object
# dtypes: object(23)
# memory usage: 496.0+ bytes
建造石斑鱼
from collections import defaultdict
# pick your pivot columns
idx = 'cap-shape'
grouper = ['cap-surface']
# populate the grouper
groups = defaultdict(list)
for chunk in pd.read_csv(file, usecols=grouper, chunksize=1000):
chunk = chunk.reset_index().set_index(grouper).squeeze()
for key, g in chunk.groupby(chunk.index):
groups[key].extend(g.to_list())
使用它来过滤以块形式加载的数据
# load a single sub-dataframe
def load_subdf(key, **kwargs):
out = []
for chunk in pd.read_csv(file, **kwargs):
out.append(chunk[chunk[grouper[0]].eq(key)])
return pd.concat(out).drop(columns=grouper)
df_f = load_subdf('f', index_col=idx, chunksize=1000)
输出
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 2320 entries, x to k
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 2320 non-null object
1 cap-color 2320 non-null object
2 bruises 2320 non-null object
3 odor 2320 non-null object
4 gill-attachment 2320 non-null object
5 gill-spacing 2320 non-null object
6 gill-size 2320 non-null object
7 gill-color 2320 non-null object
8 stalk-shape 2320 non-null object
9 stalk-root 2320 non-null object
10 stalk-surface-above-ring 2320 non-null object
11 stalk-surface-below-ring 2320 non-null object
12 stalk-color-above-ring 2320 non-null object
13 stalk-color-below-ring 2320 non-null object
14 veil-type 2320 non-null object
15 veil-color 2320 non-null object
16 ring-number 2320 non-null object
17 ring-type 2320 non-null object
18 spore-print-color 2320 non-null object
19 population 2320 non-null object
20 habitat 2320 non-null object
dtypes: object(21)
memory usage: 398.8+ KB
注意索引不再是默认范围索引,并且 grouper 列不是结果的一部分。
第一个回答:
您的数据框足够小,可以 in-memory 重塑...尝试以下操作
df = df.set_index('permno') # discard current index
dict_dfs = {date: gdf for date, gdf in df.groupby('date')}