如何对一行的列进行分组?

how to group columns of a row?

我有以下 DataFrame

df = pd.DataFrame(
    {
        "date": ["2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 12:00:00"],
        "plant_type": [1, 1, 1, 2, 3],
        "cultivation_table": [1, 1, 1, 1, 2],
        "farmer": [1, 1, 1, 1, 2],
        "activity": ["water", "germinate", "compost", "water", "germinate"],
        "duration": ["20s", "45s", "18s", "10min", "13min"],
        "in_time": [0, 1, 1, 0, 1],
        "finished": [1, 1, 1, 1, 0],
    }
)


date                plant_type cultivation_table  farmer  activity  duration in_time  finished
2022-03-01 10:00:00      1               1           1     water       20s     0          1
2022-03-01 10:00:00      1               1           1     germinate   45s     1          1
2022-03-01 10:00:00      1               1           1     compost     18s     1          1
2022-03-01 10:00:00      2               1           1     water       10min   0          1
2022-03-01 12:00:00      3               2           2     germinate   13min   1          0

我需要按日期 plant_type、cultivation_table、农民进行分组,并在列中保留 activity、持续时间、in_time 和完成。 我需要得到一个像下面这样的 table:

date                plant_type cultivation_table  farmer   water    water_in_time   water_finished   germinate   germinate_in_time   germinate_finished   compost  germinate_in_time   germinate_finished
2022-03-01 10:00:00      1               1           1      20s            0               1            45s             1                     1             18s            1                   1
2022-03-01 10:00:00      2               1           1      10m            0               1            0s              0                     0             0s             0                   0
2022-03-01 12:00:00      3               2           2       0s            0               1            13min           1                     0             0s             0                   0

我正在测试 pivot 并设法得到以下结果:

    date                   plant_type    cultivation_table    farmer     activity     compost    germinate   water
    2022-03-01 10:00:00        1               1                1         water         18s         45s       20s 
    2022-03-01 10:00:00        2               1                1         water          0           0        10m     
    2022-03-01 12:00:00        3               2                2       germinate        0         13min       0

这是代码:

(df.groupby(["date", 'plant_type', 'cultivation_table', 'farmer'])['activity'].first().reset_index()
                     .merge(df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'], 'activity', 'duration')
                            .fillna(0).reset_index(), on=["date", 'plant_type', 'cultivation_table', 'farmer']))

IIUC,pivot 就够了。剩下的就是如何填充缺失值的问题了:

out = df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'], 
               'activity', 
               ['duration', 'in_time', 'finished'])
out['duration'] = out['duration'].fillna('0s')
out.loc[:, ['in_time','finished']] = out[['in_time','finished']].fillna(0)
out.columns = [y if x=='duration' else f'{y}_{x}' for x,y in out.columns]
out = out.sort_index(axis=1, ascending=False).reset_index()

输出:

                  date  plant_type  cultivation_table  farmer  water_in_time  water_finished  water  germinate_in_time  germinate_finished germinate  compost_in_time  compost_finished compost
0  2022-03-01 10:00:00           1                  1       1              0               1    20s                  1                   1       45s                1                 1     18s
1  2022-03-01 10:00:00           2                  1       1              0               1  10min                  0                   0        0s                0                 0      0s
2  2022-03-01 12:00:00           3                  2       2              0               0     0s                  1                   0     13min                0                 0      0s