Pandas：重塑数据框

Question

我有以下数据框：

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'

zz=pd.read_csv(url)
zz.head(5)

    date    feccandid   feccandcfscore.dyn  pacid   paccfscore  cid     catcode     type_x  di  amtsum  state   log_diff_unemployment   party   type_y  bills   years_exp   disposition     billsum
0   2006    S8NV00073   0.496   C00000422   0.330   N00006619   H1100   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     3
1   2006    S8NV00073   0.496   C00375360   0.176   N00006619   H1100   24K     D   4500    NV  -0.024693   Republican  rep     s22-109     12  support     3
2   2006    S8NV00073   0.496   C00113803   0.269   N00006619   H1130   24K     D   2500    NV  -0.024693   Republican  rep     s22-109     12  support     2
3   2006    S8NV00073   0.496   C00249342   0.421   N00006619   H1130   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     2
4   2006    S8NV00073   0.496   C00255752   0.254   N00006619   H1130   24K     D   4000    NV  -0.024693   Republican  rep     s22-109     12  support     2

我想对其进行操作，使 date 列是一个索引，feccandid 值是 headers 列（稍后我会将它们设为第二个索引，以便我可以将框架发送到面板），另一列 headers 变为行。期望的输出 看起来 是这样的：

date    feccandid              S8NV00072    S8NV00074   S8NV00075   S8NV00076   S8NV00077
2006    feccandcfscore.dyn        0.496        0.496        0.496     0.496       0.496
2006    pacid                  C00000422    C00375360   C00113803   C00249342   C00255752
2006    paccfscore                  0.33        0.176      0.269         0.421    0.254
2006    cid N00006619           N00006619   N00006619   N00006619   N00006619
2006    catcode                  H1100      H1100          H1130    H1130      H1130
2006    type_x                    24K         24K            24K    24K     24K
2006    di                           D          D              D        D       D
2006    amtsum                      5000      4500          2500        5000       4000
2006    state                        NV        NV           NV        NV         NV
2006    log_diff_unemployment   -0.024693   -0.024693   -0.024693   -0.024693   -0.024693
2006    party                     Republican    Republican  Republican  Republican  Republican
2006    type_y                            rep         rep         rep       rep      rep
2006    bills                           s22-109      s22-109    s22-109    s22-109     s22-109
2006    years_exp                             12        12        12       12      12
2006    disposition                      support       support  support support support
2006    billsum                            3               3        2      2       2

我已经按照 jezrael

的推荐尝试了以下方法

zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)

zz.head()

    feccandcfscore.dyn  ...     billsum
feccandid   H0AL02087   H0AL07060   H0AR01083   H0AR02107   H0AR03055   H0AR04038   H0AZ01259   H0AZ03362   H0CA15148   H0CA19173   ...     S8MI00158   S8MN00438   S8MS00055   S8MT00010   S8NC00239   S8NE00117   S8NM00010   S8NV00073   S8OR00207   S8WI00026
date                                                                                    
2005    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2006    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     2.125   NaN     NaN
2007    NaN     0.016   NaN     NaN     NaN     -0.151  NaN     NaN     -0.777  NaN     ...     1.000000    NaN     1.666667    1.552632    NaN     NaN     2.0     1.000   NaN     2.0
2008    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     1.285714    NaN     NaN     5.431373    NaN     NaN     NaN     NaN     NaN     NaN
2009    NaN     NaN     NaN     NaN     NaN     -0.086  NaN     NaN     -0.790  NaN     ...     NaN     NaN     NaN     2.433333    NaN     NaN     NaN     NaN     3.0     2.8

这与我想要的很接近，只是我试图将 feccandid 作为唯一的列 headers 和原始列 headers（它们是- - 在最后一个例子中--作为最上面的列 headers) 被转换为行。

Answer 1

我认为你可以使用pivot_table（默认聚合函数是np.mean）：

df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df

如果您需要将 NaN 替换为 0:

print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)

编辑：

我创建了小样本 DataFrame 作为 says, you can use T and to_panel for creating panel. Then maybe you need transpose:

import pandas as pd

zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002}, 
                   'feccandid': {0: 'S8NV00072', 1: 'S8NV00074', 
                                 2: 'S8NV00072', 3: 'S8NV00074'}, 
                   'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
                   'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})

print zz
   billsum  date  feccandid  pacid
0        1  2001  S8NV00072    0.3
1        2  2001  S8NV00074    0.1
2        5  2002  S8NV00072    0.7
3        6  2002  S8NV00074    0.4

zz = zz.pivot_table(index='date', 
                         columns='feccandid',
                         fill_value=0, 
                         aggfunc=np.mean)
print zz.T   
date               2001  2002
        feccandid            
billsum S8NV00072   1.0   5.0
        S8NV00074   2.0   6.0
pacid   S8NV00072   0.3   0.7
        S8NV00074   0.1   0.4

wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074

print wp.transpose(2, 0, 1)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid

Pandas：重塑数据框

Pandas: reshape data frame

python

pivot

melt

pandas