在 Pandas 中重塑 GroupBy,如果缺失则用 nan 填充

Reshape GroupBy in Pandas and pad with nan if missing

给定一个数据框,每个组中包含不同数量的元素('groupby' 由某个变量决定),我需要重新整形为具有预定义列数的矩阵。例如:

    summary_x  participant_id_x response_date cuts
0         3.0                11    2016-05-05    a
1         3.0                11    2016-05-06    a
2         4.0                11    2016-05-07    a
3         4.0                11    2016-05-08    a
4         3.0                11    2016-05-09    a
5         3.0                11    2016-05-10    a
6         3.0                11    2016-05-11    a
7         3.0                11    2016-05-12    a
8         3.0                11    2016-05-13    a
9         3.0                11    2016-05-14    a
13        4.0                11    2016-05-22    b
14        4.0                11    2016-05-23    b
15        3.0                11    2016-05-24    b
16        3.0                11    2016-05-25    b
17        3.0                11    2016-05-26    b
18        3.0                11    2016-05-27    b
19        3.0                11    2016-05-28    b
20        3.0                11    2016-06-02    c
21        3.0                11    2016-06-03    c
22        3.0                11    2016-06-04    c
23        3.0                11    2016-06-05    c
24        3.0                11    2016-06-06    c
25        3.0                11    2016-06-07    c
26        3.0                11    2016-06-08    c
27        3.0                11    2016-06-09    c
28        3.0                11    2016-06-10    c
29        5.0                11    2016-06-11    c

每个组(by'cuts')包含 10 个元素,但组 'b' 仅包含 7 个。我想将 'summary_x' 中的矩阵重塑为 (3,10) ,其中缺失值将用 nans:

填充
pd.DataFrame(df.summary_x.values.reshape((-1,10)))

      0    1    2    3    4    5    6    7    8    9
0   3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1   nan  nan  nan  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2   3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

有什么解决办法吗?

您可以使用 cumcount[::-1] 来更改列(行)的顺序:

g = df.groupby('cuts').cumcount(ascending=False)
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).iloc[:,::-1]
       .reset_index(drop=True)
df.columns = np.arange(len(df.columns))
print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  NaN  NaN  NaN  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

另一个解决方案:

L = df[::-1].groupby('cuts')['summary_x'].apply(list).values.tolist()
df = pd.DataFrame(L).iloc[:, ::-1]
df.columns = np.arange(len(df.columns))
print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  NaN  NaN  NaN  4.0  4.0  3.0  3.0  3.0  3.0  3.0
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0

但是如果NaNs可以到底还:

g = df.groupby('cuts').cumcount()
df = pd.pivot(index=df['cuts'], columns=g, values=df['summary_x']).reset_index(drop=True)

print (df)
     0    1    2    3    4    5    6    7    8    9
0  3.0  3.0  4.0  4.0  3.0  3.0  3.0  3.0  3.0  3.0
1  4.0  4.0  3.0  3.0  3.0  3.0  3.0  NaN  NaN  NaN
2  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  3.0  5.0