根据组对 nan 值进行排名
rank for nan values based on group
我有包含 d1 列的数据框,现在我正在尝试计算 'out' 列,然后在 'nan' 列中有值时对该列进行排名。
data_input = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan']}
df_input = pd.DataFrame(data_input)
data_out = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan'],
'out':[1,np.NaN,np.NaN,np.NaN,2,2,np.NaN,np.NaN,1,1,np.NaN,np.NaN,np.NaN,2]}
df_out = pd.DataFrame(data_out)
如果在那个特定的组中,如果 nan 出现在某些值之前和之后,那么排名应该是升序的。
例如:索引 0 的排名将为 1,索引 4 和 5 将为 2(因为该组中没有后值)
df_out["out"] = df_out.groupby(["Name","type"])['d1'].rank(method="first")
每组连续缺失值使用GroupBy.cumsum
:
df_out['d1'] = pd.to_numeric(df_out['d1'], errors='coerce')
m = df_out['d1'].isna()
df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))
.groupby(["Name","type"])['a']
.cumsum()
.where(m))
boolean indexing
的替代解决方案:
df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))[m]
.groupby(["Name","type"])['a']
.cumsum())
我有包含 d1 列的数据框,现在我正在尝试计算 'out' 列,然后在 'nan' 列中有值时对该列进行排名。
data_input = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan']}
df_input = pd.DataFrame(data_input)
data_out = {'Name':['Renault', 'Renault', 'Renault', 'Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault','Renault'],
'type':['Duster', 'Duster', 'Duster','Duster','Duster','Duster','Duster','Triber','Triber','Triber','Triber','Triber','Triber','Triber'],
'd1':['nan','10','10','10','nan','nan','20','20','nan','nan','30','30','30','nan'],
'out':[1,np.NaN,np.NaN,np.NaN,2,2,np.NaN,np.NaN,1,1,np.NaN,np.NaN,np.NaN,2]}
df_out = pd.DataFrame(data_out)
如果在那个特定的组中,如果 nan 出现在某些值之前和之后,那么排名应该是升序的。 例如:索引 0 的排名将为 1,索引 4 和 5 将为 2(因为该组中没有后值)
df_out["out"] = df_out.groupby(["Name","type"])['d1'].rank(method="first")
每组连续缺失值使用GroupBy.cumsum
:
df_out['d1'] = pd.to_numeric(df_out['d1'], errors='coerce')
m = df_out['d1'].isna()
df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))
.groupby(["Name","type"])['a']
.cumsum()
.where(m))
boolean indexing
的替代解决方案:
df_out["out1"] = (df_out.assign(a = (m & ~m.shift(fill_value=False)))[m]
.groupby(["Name","type"])['a']
.cumsum())