从指定所需行数的列在数据框中创建新行(展平?)
Create new rows in a dataframe from a column specifying number of rows needed (flatten?)
我有以下格式的数据:
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 230
1 2445252 304.377500 1.000000 2.000000e+01 2020-07-09 2021-05-06 222
2 3323232 16.047443 0.017908 3.556858e-09 2020-07-10 2021-05-26 221
3 4444242 190.799229 0.162360 2.000000e+01 2020-07-11 2021-05-06 220
4 5555366 153.341044 0.000195 2.730887e-04 2020-07-01 2021-05-26 230
5 6343423 518.195900 0.000073 1.516531e+01 2020-07-12 2021-05-10 219
我想要如下所示的框架,其中每个 ID 都被扩展为包含从 0 到 delta_t 的行。
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 0
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 1
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 2
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 3
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 4
...
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 230
感谢任何帮助!
extra_delta_ts = []
for row in df.itertuples():
for i in range(row.delta_t):
row_data = [row.ID, row.qi, row.di, row.b, row.start_date, row.end_date, i]
extra_delta_ts.append(row_data)
columns = ['ID', 'qi', 'di', 'b', 'start_date', 'end_date', 'delta_t']
extra_delta_ts_df = pd.DataFrame(extra_delta_ts, columns=columns)
concat_df = pd.concat([df, extra_delta_ts_df])
concat_df.sort_values(by=['ID', 'delta_t'], inplace=True)
一种方法是先对重复量来自delta_t
列的帧形成一个重复索引,然后loc
。要将每个组的 delta_t
重置为 0..N
,我们可以使用 cumcount
:
# +1 at the end is to include `N` in `0..N`
repeated_inds = df.index.repeat(repeats=df.delta_t + 1)
new_df = df.loc[repeated_inds]
new_df.delta_t = new_df.groupby("delta_t").cumcount()
得到
>>> new_df
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 0
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 1
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 2
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 3
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 4
.. ... ... ... ... ... ... ...
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 215
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 216
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 217
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 218
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 219
[1348 rows x 7 columns]
完整性检查是:
>>> df.delta_t.add(1).sum() == len(new_df)
True
我知道这是一个可能的解决方案,但可能还有更 pythonic 的方式。
df['delta_t'] = df['delta_t'].transform(lambda x: list(range(0, x + 1)))
df = df.explode('delta_t')
然后分解这个新创建的列表列,它将提供以下所需的数据框
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 0
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 1
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 2
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 3
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 4
.. ... ... ... ... ... ... ...
5 1232111 518.195900 0.000073 15.165307 2020-07-12 2021-05-10 215
我有以下格式的数据:
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 230
1 2445252 304.377500 1.000000 2.000000e+01 2020-07-09 2021-05-06 222
2 3323232 16.047443 0.017908 3.556858e-09 2020-07-10 2021-05-26 221
3 4444242 190.799229 0.162360 2.000000e+01 2020-07-11 2021-05-06 220
4 5555366 153.341044 0.000195 2.730887e-04 2020-07-01 2021-05-26 230
5 6343423 518.195900 0.000073 1.516531e+01 2020-07-12 2021-05-10 219
我想要如下所示的框架,其中每个 ID 都被扩展为包含从 0 到 delta_t 的行。
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 0
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 1
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 2
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 3
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 4
...
0 1232111 363.639856 0.902817 2.000000e+01 2020-07-01 2021-05-05 230
感谢任何帮助!
extra_delta_ts = []
for row in df.itertuples():
for i in range(row.delta_t):
row_data = [row.ID, row.qi, row.di, row.b, row.start_date, row.end_date, i]
extra_delta_ts.append(row_data)
columns = ['ID', 'qi', 'di', 'b', 'start_date', 'end_date', 'delta_t']
extra_delta_ts_df = pd.DataFrame(extra_delta_ts, columns=columns)
concat_df = pd.concat([df, extra_delta_ts_df])
concat_df.sort_values(by=['ID', 'delta_t'], inplace=True)
一种方法是先对重复量来自delta_t
列的帧形成一个重复索引,然后loc
。要将每个组的 delta_t
重置为 0..N
,我们可以使用 cumcount
:
# +1 at the end is to include `N` in `0..N`
repeated_inds = df.index.repeat(repeats=df.delta_t + 1)
new_df = df.loc[repeated_inds]
new_df.delta_t = new_df.groupby("delta_t").cumcount()
得到
>>> new_df
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 0
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 1
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 2
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 3
0 1232111 363.639856 0.902817 20.00000 2020-07-01 2021-05-05 4
.. ... ... ... ... ... ... ...
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 215
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 216
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 217
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 218
5 6343423 518.195900 0.000073 15.16531 2020-07-12 2021-05-10 219
[1348 rows x 7 columns]
完整性检查是:
>>> df.delta_t.add(1).sum() == len(new_df)
True
我知道这是一个可能的解决方案,但可能还有更 pythonic 的方式。
df['delta_t'] = df['delta_t'].transform(lambda x: list(range(0, x + 1)))
df = df.explode('delta_t')
然后分解这个新创建的列表列,它将提供以下所需的数据框
ID qi di b start_date end_date delta_t
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 0
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 1
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 2
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 3
0 1232111 363.639856 0.902817 20.000000 2020-07-01 2021-05-05 4
.. ... ... ... ... ... ... ...
5 1232111 518.195900 0.000073 15.165307 2020-07-12 2021-05-10 215