将宽数据框转换为具有特定条件的长数据框并添加新列
Convert wide dataframe to long dataframe with specific conditions and addition of new columns
我有一个示例数据框,如下所示。
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A','A','A','A','A','A','A','A','A','C','C','C','C','C','C','C','C'],
'Week': ['Week1','Week1','Week1','Week1','Week2','Week2','Week2','Week2','Week3',
'Week1','Week1','Week1','Week1','Week2','Week2','Week2','Week2'],
'Risk':['High','','','','','','','','','High','','','','','','',''],
'Testing':[NaN,'Pos',NaN,'Neg',NaN,NaN,NaN,NaN,'Pos', NaN,
NaN,NaN,'Negative',NaN,NaN,NaN,'Positive'],
'Week1_adher':['Yes',NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,'No',NaN,NaN,NaN,NaN,NaN,NaN,NaN],
'Week2_adher':['No',NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,'No',NaN,NaN,NaN,NaN,NaN,NaN,NaN],
'Week3_adher':['No',NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,'No',NaN,NaN,NaN,NaN,NaN,NaN,NaN]}
df1 = pd.DataFrame(data)
df1
最终数据框的行数必须与每个参与者的周数一样多。将周列转换为行后,它应该有相应的值。
此外,每个参与者每周 'Testing' 列中的 notna 值的数量应添加到“#of test”值中。
最终数据框应如下图所示。
通过创建两个新列来预处理数据框,然后按 ID
和 Week
分组,最后聚合新列:
df1['SurveyAdherence'] = df1.filter(regex=r'Week\d+_adher').eq('Yes').any(axis=1)
df1['#Tests'] = df1['Testing'].notna()
mi = pd.MultiIndex.from_product([df1['ID'].unique(), df1['Week'].unique()],
names=['ID', 'Week'])
out = df1.groupby(['ID', 'Week']) \
.agg({'SurveyAdherence': 'max', '#Tests': 'sum'}) \
out = out.reindex(mi) \
.fillna({'SurveyAdherence': False, '#Tests': 0}) \
.astype({'SurveyAdherence': bool, '#Tests': int}) \
.reset_index()
输出:
>>> df1
ID Week SurveyAdherence #Tests
0 A Week1 True 2
1 A Week2 False 0
2 A Week3 False 1
3 C Week1 False 1
4 C Week2 False 1
5 C Week3 False 0
我有一个示例数据框,如下所示。
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A','A','A','A','A','A','A','A','A','C','C','C','C','C','C','C','C'],
'Week': ['Week1','Week1','Week1','Week1','Week2','Week2','Week2','Week2','Week3',
'Week1','Week1','Week1','Week1','Week2','Week2','Week2','Week2'],
'Risk':['High','','','','','','','','','High','','','','','','',''],
'Testing':[NaN,'Pos',NaN,'Neg',NaN,NaN,NaN,NaN,'Pos', NaN,
NaN,NaN,'Negative',NaN,NaN,NaN,'Positive'],
'Week1_adher':['Yes',NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,'No',NaN,NaN,NaN,NaN,NaN,NaN,NaN],
'Week2_adher':['No',NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,'No',NaN,NaN,NaN,NaN,NaN,NaN,NaN],
'Week3_adher':['No',NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,'No',NaN,NaN,NaN,NaN,NaN,NaN,NaN]}
df1 = pd.DataFrame(data)
df1
最终数据框的行数必须与每个参与者的周数一样多。将周列转换为行后,它应该有相应的值。
此外,每个参与者每周 'Testing' 列中的 notna 值的数量应添加到“#of test”值中。
最终数据框应如下图所示。
通过创建两个新列来预处理数据框,然后按 ID
和 Week
分组,最后聚合新列:
df1['SurveyAdherence'] = df1.filter(regex=r'Week\d+_adher').eq('Yes').any(axis=1)
df1['#Tests'] = df1['Testing'].notna()
mi = pd.MultiIndex.from_product([df1['ID'].unique(), df1['Week'].unique()],
names=['ID', 'Week'])
out = df1.groupby(['ID', 'Week']) \
.agg({'SurveyAdherence': 'max', '#Tests': 'sum'}) \
out = out.reindex(mi) \
.fillna({'SurveyAdherence': False, '#Tests': 0}) \
.astype({'SurveyAdherence': bool, '#Tests': int}) \
.reset_index()
输出:
>>> df1
ID Week SurveyAdherence #Tests
0 A Week1 True 2
1 A Week2 False 0
2 A Week3 False 1
3 C Week1 False 1
4 C Week2 False 1
5 C Week3 False 0