如何根据两列值分组df并处理pandas中的缺失值?
How to groupby df according to two column values and handling missing values in pandas?
我想根据多次出现的 ID 和模式类型(活动、非活动)拆分具有约 100 万行的数据集。拆分时,数字列中的缺失值应该被插值,分类值应该用 ffill 填充。最后,将删除剩余的零值。为此,我编写了以下函数:
objectList = list(df_sorted.select_dtypes(include=["O", "datetime64[ns]"]).columns)
floatList = list(df_sorted.select_dtypes(include=["float64"]).columns)
def fill_missing_values(df_group):
df_group[objectList] = df_group[objectList].ffill()
df_group[floatList] = df_group[floatList].interpolate(
method="linear", limit_direction="forward"
)
df_group.dropna()
return df_group
函数现在应用如下:
df_nn = df_sorted.groupby(["ID", "Mode"]).apply(
lambda df_sorted: fill_missing_values(df_sorted)
)
单元格执行无误,但输出时间太长。所以我的问题是:这种方法总体上是正确的还是我遗漏了什么?这段代码如何获得更高的性能?
输入数据
df = pd.DataFrame(
{
"ID": ["0A", "0A", "0A", "0A", "0A", "1C", "1C", "1C", "1C"],
"MODE": [
"active",
"active",
"active",
"inactive",
"inactive",
"active",
"active",
"active",
"inactive",
],
"Signal1 ": [13, np.nan, 4, 11, np.nan, 22, 25, np.nan, 19],
"Signal2 ": [np.nan, 0.1, 0.3, "NaN", 4.5, "NaN", 2.0, 3.0, np.nan],
"Signal3 ": ["on", np.nan, np.nan, "off", np.nan, "on", np.nan, "on", np.nan],
}
)
df
ID MODE Signal1 Signal2 Signal3
0 0A active 13 NaN on
1 0A active NaN 0.1 NaN
2 0A active 4 0.3 NaN
3 0A inactive 11 NaN off
4 0A inactive NaN 4.5 NaN
5 1C active 22 NaN on
6 1C active 25 2.0 NaN
7 1C active NaN 3.0 on
8 1C inactive 19 NaN NaN
填充和插值 ID“0A”后的所需输出:
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
ID“0A”的 dropna 后的期望输出:
ID MODE Signal1 Signal2 Signal3
0 0A active 8.5 0.1 on
1 0A active 4.0 0.3 on
ID MODE Signal1 Signal2 Signal3
0 0A inactive 11 4.5 off
IIUC,你想要:
groupby
ID 和 MODE 列并插入所有数字列
groupby
ID 和 MODE 列并填充所有非数字列
import numpy as np
#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)
numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns
df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")
>>> df
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
5 1C active 22.0 NaN on
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on
8 1C inactive 19.0 NaN NaN
>>> df.dropna()
ID MODE Signal1 Signal2 Signal3
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
4 0A inactive 11.0 4.5 off
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on
首先用 :
的平均值填充 Signal1
df['Signal1']=df.groupby(['ID','MODE'])['Signal1'].apply(lambda x:x.fillna(x.mean()))
下一步groupby获取Signal3
并合并
signal3 = df[['ID','MODE','Signal3']].dropna().drop_duplicates()
signal3 = signal3.rename(columns={'Signal3':'Signal3_new'})
df2 = pd.merge(df,signal3, how='left', on=['ID','MODE'])
在 Signal3
中填写 Signal3_new
或使用 Signal3_new
我想根据多次出现的 ID 和模式类型(活动、非活动)拆分具有约 100 万行的数据集。拆分时,数字列中的缺失值应该被插值,分类值应该用 ffill 填充。最后,将删除剩余的零值。为此,我编写了以下函数:
objectList = list(df_sorted.select_dtypes(include=["O", "datetime64[ns]"]).columns)
floatList = list(df_sorted.select_dtypes(include=["float64"]).columns)
def fill_missing_values(df_group):
df_group[objectList] = df_group[objectList].ffill()
df_group[floatList] = df_group[floatList].interpolate(
method="linear", limit_direction="forward"
)
df_group.dropna()
return df_group
函数现在应用如下:
df_nn = df_sorted.groupby(["ID", "Mode"]).apply(
lambda df_sorted: fill_missing_values(df_sorted)
)
单元格执行无误,但输出时间太长。所以我的问题是:这种方法总体上是正确的还是我遗漏了什么?这段代码如何获得更高的性能?
输入数据
df = pd.DataFrame(
{
"ID": ["0A", "0A", "0A", "0A", "0A", "1C", "1C", "1C", "1C"],
"MODE": [
"active",
"active",
"active",
"inactive",
"inactive",
"active",
"active",
"active",
"inactive",
],
"Signal1 ": [13, np.nan, 4, 11, np.nan, 22, 25, np.nan, 19],
"Signal2 ": [np.nan, 0.1, 0.3, "NaN", 4.5, "NaN", 2.0, 3.0, np.nan],
"Signal3 ": ["on", np.nan, np.nan, "off", np.nan, "on", np.nan, "on", np.nan],
}
)
df
ID MODE Signal1 Signal2 Signal3
0 0A active 13 NaN on
1 0A active NaN 0.1 NaN
2 0A active 4 0.3 NaN
3 0A inactive 11 NaN off
4 0A inactive NaN 4.5 NaN
5 1C active 22 NaN on
6 1C active 25 2.0 NaN
7 1C active NaN 3.0 on
8 1C inactive 19 NaN NaN
填充和插值 ID“0A”后的所需输出:
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
ID“0A”的 dropna 后的期望输出:
ID MODE Signal1 Signal2 Signal3
0 0A active 8.5 0.1 on
1 0A active 4.0 0.3 on
ID MODE Signal1 Signal2 Signal3
0 0A inactive 11 4.5 off
IIUC,你想要:
groupby
ID 和 MODE 列并插入所有数字列groupby
ID 和 MODE 列并填充所有非数字列
import numpy as np
#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)
numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns
df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")
>>> df
ID MODE Signal1 Signal2 Signal3
0 0A active 13.0 NaN on
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
3 0A inactive 11.0 NaN off
4 0A inactive 11.0 4.5 off
5 1C active 22.0 NaN on
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on
8 1C inactive 19.0 NaN NaN
>>> df.dropna()
ID MODE Signal1 Signal2 Signal3
1 0A active 8.5 0.1 on
2 0A active 4.0 0.3 on
4 0A inactive 11.0 4.5 off
6 1C active 25.0 2.0 on
7 1C active 25.0 3.0 on
首先用 :
的平均值填充Signal1
df['Signal1']=df.groupby(['ID','MODE'])['Signal1'].apply(lambda x:x.fillna(x.mean()))
下一步groupby获取Signal3
并合并
signal3 = df[['ID','MODE','Signal3']].dropna().drop_duplicates()
signal3 = signal3.rename(columns={'Signal3':'Signal3_new'})
df2 = pd.merge(df,signal3, how='left', on=['ID','MODE'])
在 Signal3
中填写 Signal3_new
或使用 Signal3_new