根据组对 nan 值进行排名

rank for nan values based on group

我有包含 d1 列的数据框,现在我正在尝试计算 'out' 列,然后在 'nan' 列中有值时对该列进行排名。

  data_input = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
                  'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
             'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan']}  
    
    df_input = pd.DataFrame(data_input)
    
    data_out = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
                  'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
             'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan'],
             'out':[1,np.NaN,np.NaN,np.NaN,2,2,np.NaN,np.NaN,1,1,np.NaN,np.NaN,np.NaN,2]}  
    
    df_out = pd.DataFrame(data_out)

如果在那个特定的组中,如果 nan 出现在某些值之前和之后,那么排名应该是升序的。 例如:索引 0 的排名将为 1,索引 4 和 5 将为 2(因为该组中没有后值)

df_out["out"] = df_out.groupby(["Name","type"])['d1'].rank(method="first")

每组连续缺失值使用GroupBy.cumsum

df_out['d1'] = pd.to_numeric(df_out['d1'], errors='coerce')


m = df_out['d1'].isna()

df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))
                        .groupby(["Name","type"])['a']
                        .cumsum()
                        .where(m))

boolean indexing 的替代解决方案:

df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))[m]
                        .groupby(["Name","type"])['a']
                        .cumsum())