在 2 列上分组并将一列拆分为具有前 2 个非 Na 值的 2 列

Groupby on 2 columns and split a column into 2 columns with first 2 nonNa values

我有一个示例数据框,如下所示。

import pandas as pd
import numpy as np

NaN = np.nan
data = {'ID':['A','A','A','A','A','A','A','A','A','C','C','C','C','C','C','C','C'],
    'Week': ['Week1','Week1','Week1','Week1','Week2','Week2','Week2','Week2','Week3',
             'Week1','Week1','Week1','Week1','Week2','Week2','Week2','Week2'],
    'Risk':['High','','','','','','','','','High','','','','','','',''],
    'Testing':[NaN,'Pos',NaN,'Neg',NaN,NaN,NaN,NaN,'Pos', NaN, 
              NaN,NaN,'Negative',NaN,NaN,NaN,'Positive'],
    'CloseContact': [NaN, 'True', NaN, NaN, 'False',NaN, NaN, 'False', 'True', 
                    NaN, NaN, 'False', NaN, 'True','True','False', NaN ]}
    
df1 = pd.DataFrame(data)
df1 

现在,必须创建 2 列 CC1 和 CC2。对于每个 ID,每周(重要),CC1 将获得 'CloseContact' 列的第一个非空值,CC2 将获得 'CloseContact' 列的第二个非空值。

最终数据框应如下图所示。

非常感谢任何帮助。谢谢。

尝试:

import pandas as pd
import numpy as np

NaN = np.nan
data = {'ID': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
        'Week': ['Week1', 'Week1', 'Week1', 'Week1', 'Week2', 'Week2', 'Week2', 'Week2', 'Week3',
                 'Week1', 'Week1', 'Week1', 'Week1', 'Week2', 'Week2', 'Week2', 'Week2', 'Week3'],
        'Risk': ['High', '', '', '', '', '', '', '', '', 'High', '', '', '', '', '', '', '', ''],
        'Testing': [NaN, 'Pos', NaN, 'Neg', NaN, NaN, NaN, NaN, 'Pos', NaN,
                    NaN, NaN, 'Negative', NaN, NaN, NaN, 'Positive', NaN],
        'CloseContact': [NaN, NaN, NaN, NaN, 'False', NaN, NaN, 'False', 'True',
                         NaN, NaN, 'False', NaN, 'True', 'True', 'False', NaN, NaN]}

df1 = pd.DataFrame(data)

df = df1.groupby(['ID', 'Week'])['CloseContact'].apply(lambda x: x[x.notnull()].values[0:2]).reset_index()
df[['CC1','CC2']] = pd.DataFrame(df.CloseContact.tolist(), index= df.index)
df.drop(columns=['CloseContact'], inplace=True)
print(df)

原DF:

   ID   Week  Risk   Testing CloseContact
0   A  Week1  High       NaN          NaN
1   A  Week1             Pos          NaN
2   A  Week1             NaN          NaN
3   A  Week1             Neg          NaN
4   A  Week2             NaN        False
5   A  Week2             NaN          NaN
6   A  Week2             NaN          NaN
7   A  Week2             NaN        False
8   A  Week3             Pos         True
9   C  Week1  High       NaN          NaN
10  C  Week1             NaN          NaN
11  C  Week1             NaN        False
12  C  Week1        Negative          NaN
13  C  Week2             NaN         True
14  C  Week2             NaN         True
15  C  Week2             NaN        False
16  C  Week2        Positive          NaN
17  C  Week3             NaN          NaN

最终输出:

  ID   Week    CC1    CC2
0  A  Week1   None   None
1  A  Week2  False  False
2  A  Week3   True   None
3  C  Week1  False   None
4  C  Week2   True   True
5  C  Week3   None   None

喜欢你的

mi = pd.MultiIndex.from_product([df1['ID'].unique(), df1['Week'].unique()],
                                names=['ID', 'Week'])

out = df1.loc[df1['CloseContact'].notna()] \
         .groupby(['ID', 'Week'])['CloseContact'] \
         .apply(lambda x: x.head(2).tolist()) \
         .apply(pd.Series).rename(columns={0: 'CC1', 1: 'CC2'}) \
         .reindex(mi).reset_index()

输出:

>>> out
  ID   Week    CC1    CC2
0  A  Week1   True    NaN
1  A  Week2  False  False
2  A  Week3   True    NaN
3  C  Week1  False    NaN
4  C  Week2   True   True
5  C  Week3    NaN    NaN