Pandas:重塑数据框
Pandas: reshape data frame
我有以下数据框:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'
zz=pd.read_csv(url)
zz.head(5)
date feccandid feccandcfscore.dyn pacid paccfscore cid catcode type_x di amtsum state log_diff_unemployment party type_y bills years_exp disposition billsum
0 2006 S8NV00073 0.496 C00000422 0.330 N00006619 H1100 24K D 5000 NV -0.024693 Republican rep s22-109 12 support 3
1 2006 S8NV00073 0.496 C00375360 0.176 N00006619 H1100 24K D 4500 NV -0.024693 Republican rep s22-109 12 support 3
2 2006 S8NV00073 0.496 C00113803 0.269 N00006619 H1130 24K D 2500 NV -0.024693 Republican rep s22-109 12 support 2
3 2006 S8NV00073 0.496 C00249342 0.421 N00006619 H1130 24K D 5000 NV -0.024693 Republican rep s22-109 12 support 2
4 2006 S8NV00073 0.496 C00255752 0.254 N00006619 H1130 24K D 4000 NV -0.024693 Republican rep s22-109 12 support 2
我想对其进行操作,使 date
列是一个索引,feccandid
值是 headers 列(稍后我会将它们设为第二个索引,以便我可以将框架发送到面板),另一列 headers 变为行。期望的输出 看起来 是这样的:
date feccandid S8NV00072 S8NV00074 S8NV00075 S8NV00076 S8NV00077
2006 feccandcfscore.dyn 0.496 0.496 0.496 0.496 0.496
2006 pacid C00000422 C00375360 C00113803 C00249342 C00255752
2006 paccfscore 0.33 0.176 0.269 0.421 0.254
2006 cid N00006619 N00006619 N00006619 N00006619 N00006619
2006 catcode H1100 H1100 H1130 H1130 H1130
2006 type_x 24K 24K 24K 24K 24K
2006 di D D D D D
2006 amtsum 5000 4500 2500 5000 4000
2006 state NV NV NV NV NV
2006 log_diff_unemployment -0.024693 -0.024693 -0.024693 -0.024693 -0.024693
2006 party Republican Republican Republican Republican Republican
2006 type_y rep rep rep rep rep
2006 bills s22-109 s22-109 s22-109 s22-109 s22-109
2006 years_exp 12 12 12 12 12
2006 disposition support support support support support
2006 billsum 3 3 2 2 2
我已经按照 jezrael
的推荐尝试了以下方法
zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)
zz.head()
feccandcfscore.dyn ... billsum
feccandid H0AL02087 H0AL07060 H0AR01083 H0AR02107 H0AR03055 H0AR04038 H0AZ01259 H0AZ03362 H0CA15148 H0CA19173 ... S8MI00158 S8MN00438 S8MS00055 S8MT00010 S8NC00239 S8NE00117 S8NM00010 S8NV00073 S8OR00207 S8WI00026
date
2005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2.125 NaN NaN
2007 NaN 0.016 NaN NaN NaN -0.151 NaN NaN -0.777 NaN ... 1.000000 NaN 1.666667 1.552632 NaN NaN 2.0 1.000 NaN 2.0
2008 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.285714 NaN NaN 5.431373 NaN NaN NaN NaN NaN NaN
2009 NaN NaN NaN NaN NaN -0.086 NaN NaN -0.790 NaN ... NaN NaN NaN 2.433333 NaN NaN NaN NaN 3.0 2.8
这与我想要的很接近,只是我试图将 feccandid
作为唯一的列 headers 和原始列 headers(它们是- - 在最后一个例子中--作为最上面的列 headers) 被转换为行。
我认为你可以使用pivot_table
(默认聚合函数是np.mean
):
df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df
如果您需要将 NaN
替换为 0
:
print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
编辑:
我创建了小样本 DataFrame
作为 says, you can use T
and to_panel
for creating panel
. Then maybe you need transpose
:
import pandas as pd
zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002},
'feccandid': {0: 'S8NV00072', 1: 'S8NV00074',
2: 'S8NV00072', 3: 'S8NV00074'},
'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})
print zz
billsum date feccandid pacid
0 1 2001 S8NV00072 0.3
1 2 2001 S8NV00074 0.1
2 5 2002 S8NV00072 0.7
3 6 2002 S8NV00074 0.4
zz = zz.pivot_table(index='date',
columns='feccandid',
fill_value=0,
aggfunc=np.mean)
print zz.T
date 2001 2002
feccandid
billsum S8NV00072 1.0 5.0
S8NV00074 2.0 6.0
pacid S8NV00072 0.3 0.7
S8NV00074 0.1 0.4
wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074
print wp.transpose(2, 0, 1)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid
我有以下数据框:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'
zz=pd.read_csv(url)
zz.head(5)
date feccandid feccandcfscore.dyn pacid paccfscore cid catcode type_x di amtsum state log_diff_unemployment party type_y bills years_exp disposition billsum
0 2006 S8NV00073 0.496 C00000422 0.330 N00006619 H1100 24K D 5000 NV -0.024693 Republican rep s22-109 12 support 3
1 2006 S8NV00073 0.496 C00375360 0.176 N00006619 H1100 24K D 4500 NV -0.024693 Republican rep s22-109 12 support 3
2 2006 S8NV00073 0.496 C00113803 0.269 N00006619 H1130 24K D 2500 NV -0.024693 Republican rep s22-109 12 support 2
3 2006 S8NV00073 0.496 C00249342 0.421 N00006619 H1130 24K D 5000 NV -0.024693 Republican rep s22-109 12 support 2
4 2006 S8NV00073 0.496 C00255752 0.254 N00006619 H1130 24K D 4000 NV -0.024693 Republican rep s22-109 12 support 2
我想对其进行操作,使 date
列是一个索引,feccandid
值是 headers 列(稍后我会将它们设为第二个索引,以便我可以将框架发送到面板),另一列 headers 变为行。期望的输出 看起来 是这样的:
date feccandid S8NV00072 S8NV00074 S8NV00075 S8NV00076 S8NV00077
2006 feccandcfscore.dyn 0.496 0.496 0.496 0.496 0.496
2006 pacid C00000422 C00375360 C00113803 C00249342 C00255752
2006 paccfscore 0.33 0.176 0.269 0.421 0.254
2006 cid N00006619 N00006619 N00006619 N00006619 N00006619
2006 catcode H1100 H1100 H1130 H1130 H1130
2006 type_x 24K 24K 24K 24K 24K
2006 di D D D D D
2006 amtsum 5000 4500 2500 5000 4000
2006 state NV NV NV NV NV
2006 log_diff_unemployment -0.024693 -0.024693 -0.024693 -0.024693 -0.024693
2006 party Republican Republican Republican Republican Republican
2006 type_y rep rep rep rep rep
2006 bills s22-109 s22-109 s22-109 s22-109 s22-109
2006 years_exp 12 12 12 12 12
2006 disposition support support support support support
2006 billsum 3 3 2 2 2
我已经按照 jezrael
的推荐尝试了以下方法zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)
zz.head()
feccandcfscore.dyn ... billsum
feccandid H0AL02087 H0AL07060 H0AR01083 H0AR02107 H0AR03055 H0AR04038 H0AZ01259 H0AZ03362 H0CA15148 H0CA19173 ... S8MI00158 S8MN00438 S8MS00055 S8MT00010 S8NC00239 S8NE00117 S8NM00010 S8NV00073 S8OR00207 S8WI00026
date
2005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2.125 NaN NaN
2007 NaN 0.016 NaN NaN NaN -0.151 NaN NaN -0.777 NaN ... 1.000000 NaN 1.666667 1.552632 NaN NaN 2.0 1.000 NaN 2.0
2008 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.285714 NaN NaN 5.431373 NaN NaN NaN NaN NaN NaN
2009 NaN NaN NaN NaN NaN -0.086 NaN NaN -0.790 NaN ... NaN NaN NaN 2.433333 NaN NaN NaN NaN 3.0 2.8
这与我想要的很接近,只是我试图将 feccandid
作为唯一的列 headers 和原始列 headers(它们是- - 在最后一个例子中--作为最上面的列 headers) 被转换为行。
我认为你可以使用pivot_table
(默认聚合函数是np.mean
):
df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df
如果您需要将 NaN
替换为 0
:
print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
编辑:
我创建了小样本 DataFrame
作为 T
and to_panel
for creating panel
. Then maybe you need transpose
:
import pandas as pd
zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002},
'feccandid': {0: 'S8NV00072', 1: 'S8NV00074',
2: 'S8NV00072', 3: 'S8NV00074'},
'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})
print zz
billsum date feccandid pacid
0 1 2001 S8NV00072 0.3
1 2 2001 S8NV00074 0.1
2 5 2002 S8NV00072 0.7
3 6 2002 S8NV00074 0.4
zz = zz.pivot_table(index='date',
columns='feccandid',
fill_value=0,
aggfunc=np.mean)
print zz.T
date 2001 2002
feccandid
billsum S8NV00072 1.0 5.0
S8NV00074 2.0 6.0
pacid S8NV00072 0.3 0.7
S8NV00074 0.1 0.4
wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074
print wp.transpose(2, 0, 1)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid