从指定所需行数的列在数据框中创建新行(展平?)

Create new rows in a dataframe from a column specifying number of rows needed (flatten?)

我有以下格式的数据:

    ID          qi        di            b       start_date   end_date    delta_t
0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      230
1  2445252  304.377500  1.000000  2.000000e+01 2020-07-09 2021-05-06      222
2  3323232   16.047443  0.017908  3.556858e-09 2020-07-10 2021-05-26      221
3  4444242  190.799229  0.162360  2.000000e+01 2020-07-11 2021-05-06      220
4  5555366  153.341044  0.000195  2.730887e-04 2020-07-01 2021-05-26      230
5  6343423  518.195900  0.000073  1.516531e+01 2020-07-12 2021-05-10      219

我想要如下所示的框架,其中每个 ID 都被扩展为包含从 0 到 delta_t 的行。

     ID          qi        di            b       start_date   end_date    delta_t
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      0
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      1
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      2
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      3
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      4
...
 0  1232111  363.639856  0.902817  2.000000e+01 2020-07-01 2021-05-05      230

感谢任何帮助!

extra_delta_ts = []
for row in df.itertuples():
    for i in range(row.delta_t):
        row_data = [row.ID, row.qi, row.di, row.b, row.start_date, row.end_date, i]   
        extra_delta_ts.append(row_data)

columns = ['ID', 'qi', 'di', 'b',  'start_date', 'end_date', 'delta_t']
extra_delta_ts_df = pd.DataFrame(extra_delta_ts, columns=columns)
concat_df = pd.concat([df, extra_delta_ts_df])
concat_df.sort_values(by=['ID', 'delta_t'], inplace=True)

一种方法是先对重复量来自delta_t列的帧形成一个重复索引,然后loc。要将每个组的 delta_t 重置为 0..N,我们可以使用 cumcount:

# +1 at the end is to include `N` in `0..N`
repeated_inds = df.index.repeat(repeats=df.delta_t + 1)

new_df = df.loc[repeated_inds]

new_df.delta_t = new_df.groupby("delta_t").cumcount()

得到

>>> new_df

         ID          qi        di         b  start_date    end_date  delta_t
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        0
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        1
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        2
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        3
0   1232111  363.639856  0.902817  20.00000  2020-07-01  2021-05-05        4
..      ...         ...       ...       ...         ...         ...      ...
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      215
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      216
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      217
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      218
5   6343423  518.195900  0.000073  15.16531  2020-07-12  2021-05-10      219

[1348 rows x 7 columns]

完整性检查是:

>>> df.delta_t.add(1).sum() == len(new_df)
True

我知道这是一个可能的解决方案,但可能还有更 pythonic 的方式。

df['delta_t'] = df['delta_t'].transform(lambda x:  list(range(0, x + 1)))
df = df.explode('delta_t')

然后分解这个新创建的列表列,它将提供以下所需的数据框

    ID          qi        di          b       start_date   end_date delta_t
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       0
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       1
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       2
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       3
0   1232111  363.639856  0.902817  20.000000 2020-07-01 2021-05-05       4
..           ...         ...       ...        ...        ...        ...     ...
5   1232111  518.195900  0.000073  15.165307 2020-07-12 2021-05-10     215