如何对一行的列进行分组?
how to group columns of a row?
我有以下 DataFrame
df = pd.DataFrame(
{
"date": ["2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 12:00:00"],
"plant_type": [1, 1, 1, 2, 3],
"cultivation_table": [1, 1, 1, 1, 2],
"farmer": [1, 1, 1, 1, 2],
"activity": ["water", "germinate", "compost", "water", "germinate"],
"duration": ["20s", "45s", "18s", "10min", "13min"],
"in_time": [0, 1, 1, 0, 1],
"finished": [1, 1, 1, 1, 0],
}
)
date plant_type cultivation_table farmer activity duration in_time finished
2022-03-01 10:00:00 1 1 1 water 20s 0 1
2022-03-01 10:00:00 1 1 1 germinate 45s 1 1
2022-03-01 10:00:00 1 1 1 compost 18s 1 1
2022-03-01 10:00:00 2 1 1 water 10min 0 1
2022-03-01 12:00:00 3 2 2 germinate 13min 1 0
我需要按日期 plant_type、cultivation_table、农民进行分组,并在列中保留 activity、持续时间、in_time 和完成。
我需要得到一个像下面这样的 table:
date plant_type cultivation_table farmer water water_in_time water_finished germinate germinate_in_time germinate_finished compost germinate_in_time germinate_finished
2022-03-01 10:00:00 1 1 1 20s 0 1 45s 1 1 18s 1 1
2022-03-01 10:00:00 2 1 1 10m 0 1 0s 0 0 0s 0 0
2022-03-01 12:00:00 3 2 2 0s 0 1 13min 1 0 0s 0 0
我正在测试 pivot 并设法得到以下结果:
date plant_type cultivation_table farmer activity compost germinate water
2022-03-01 10:00:00 1 1 1 water 18s 45s 20s
2022-03-01 10:00:00 2 1 1 water 0 0 10m
2022-03-01 12:00:00 3 2 2 germinate 0 13min 0
这是代码:
(df.groupby(["date", 'plant_type', 'cultivation_table', 'farmer'])['activity'].first().reset_index()
.merge(df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'], 'activity', 'duration')
.fillna(0).reset_index(), on=["date", 'plant_type', 'cultivation_table', 'farmer']))
IIUC,pivot
就够了。剩下的就是如何填充缺失值的问题了:
out = df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'],
'activity',
['duration', 'in_time', 'finished'])
out['duration'] = out['duration'].fillna('0s')
out.loc[:, ['in_time','finished']] = out[['in_time','finished']].fillna(0)
out.columns = [y if x=='duration' else f'{y}_{x}' for x,y in out.columns]
out = out.sort_index(axis=1, ascending=False).reset_index()
输出:
date plant_type cultivation_table farmer water_in_time water_finished water germinate_in_time germinate_finished germinate compost_in_time compost_finished compost
0 2022-03-01 10:00:00 1 1 1 0 1 20s 1 1 45s 1 1 18s
1 2022-03-01 10:00:00 2 1 1 0 1 10min 0 0 0s 0 0 0s
2 2022-03-01 12:00:00 3 2 2 0 0 0s 1 0 13min 0 0 0s
我有以下 DataFrame
df = pd.DataFrame(
{
"date": ["2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 10:00:00", "2022-03-01 12:00:00"],
"plant_type": [1, 1, 1, 2, 3],
"cultivation_table": [1, 1, 1, 1, 2],
"farmer": [1, 1, 1, 1, 2],
"activity": ["water", "germinate", "compost", "water", "germinate"],
"duration": ["20s", "45s", "18s", "10min", "13min"],
"in_time": [0, 1, 1, 0, 1],
"finished": [1, 1, 1, 1, 0],
}
)
date plant_type cultivation_table farmer activity duration in_time finished
2022-03-01 10:00:00 1 1 1 water 20s 0 1
2022-03-01 10:00:00 1 1 1 germinate 45s 1 1
2022-03-01 10:00:00 1 1 1 compost 18s 1 1
2022-03-01 10:00:00 2 1 1 water 10min 0 1
2022-03-01 12:00:00 3 2 2 germinate 13min 1 0
我需要按日期 plant_type、cultivation_table、农民进行分组,并在列中保留 activity、持续时间、in_time 和完成。 我需要得到一个像下面这样的 table:
date plant_type cultivation_table farmer water water_in_time water_finished germinate germinate_in_time germinate_finished compost germinate_in_time germinate_finished
2022-03-01 10:00:00 1 1 1 20s 0 1 45s 1 1 18s 1 1
2022-03-01 10:00:00 2 1 1 10m 0 1 0s 0 0 0s 0 0
2022-03-01 12:00:00 3 2 2 0s 0 1 13min 1 0 0s 0 0
我正在测试 pivot 并设法得到以下结果:
date plant_type cultivation_table farmer activity compost germinate water
2022-03-01 10:00:00 1 1 1 water 18s 45s 20s
2022-03-01 10:00:00 2 1 1 water 0 0 10m
2022-03-01 12:00:00 3 2 2 germinate 0 13min 0
这是代码:
(df.groupby(["date", 'plant_type', 'cultivation_table', 'farmer'])['activity'].first().reset_index()
.merge(df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'], 'activity', 'duration')
.fillna(0).reset_index(), on=["date", 'plant_type', 'cultivation_table', 'farmer']))
IIUC,pivot
就够了。剩下的就是如何填充缺失值的问题了:
out = df.pivot(['date', 'plant_type', 'cultivation_table', 'farmer'],
'activity',
['duration', 'in_time', 'finished'])
out['duration'] = out['duration'].fillna('0s')
out.loc[:, ['in_time','finished']] = out[['in_time','finished']].fillna(0)
out.columns = [y if x=='duration' else f'{y}_{x}' for x,y in out.columns]
out = out.sort_index(axis=1, ascending=False).reset_index()
输出:
date plant_type cultivation_table farmer water_in_time water_finished water germinate_in_time germinate_finished germinate compost_in_time compost_finished compost
0 2022-03-01 10:00:00 1 1 1 0 1 20s 1 1 45s 1 1 18s
1 2022-03-01 10:00:00 2 1 1 0 1 10min 0 0 0s 0 0 0s
2 2022-03-01 12:00:00 3 2 2 0 0 0s 1 0 13min 0 0 0s