如何根据两列值分组df并处理pandas中的缺失值?

How to groupby df according to two column values and handling missing values in pandas?

我想根据多次出现的 ID 和模式类型(活动、非活动)拆分具有约 100 万行的数据集。拆分时,数字列中的缺失值应该被插值,分类值应该用 ffill 填充。最后,将删除剩余的零值。为此,我编写了以下函数:

objectList = list(df_sorted.select_dtypes(include=["O", "datetime64[ns]"]).columns)
floatList = list(df_sorted.select_dtypes(include=["float64"]).columns)

def fill_missing_values(df_group):
    df_group[objectList] = df_group[objectList].ffill()
    df_group[floatList] = df_group[floatList].interpolate(
        method="linear", limit_direction="forward"
    )
    df_group.dropna()
    return df_group

函数现在应用如下:

df_nn = df_sorted.groupby(["ID", "Mode"]).apply(
    lambda df_sorted: fill_missing_values(df_sorted)
)

单元格执行无误,但输出时间太长。所以我的问题是:这种方法总体上是正确的还是我遗漏了什么?这段代码如何获得更高的性能?

输入数据

df = pd.DataFrame(
    {
        "ID": ["0A", "0A", "0A", "0A", "0A", "1C", "1C", "1C", "1C"],
        "MODE": [
            "active",
            "active",
            "active",
            "inactive",
            "inactive",
            "active",
            "active",
            "active",
            "inactive",
        ],
        "Signal1  ": [13, np.nan, 4, 11, np.nan, 22, 25, np.nan, 19],
        "Signal2  ": [np.nan, 0.1, 0.3, "NaN", 4.5, "NaN", 2.0, 3.0, np.nan],
        "Signal3  ": ["on", np.nan, np.nan, "off", np.nan, "on", np.nan, "on", np.nan],
    }
)

df

    ID  MODE     Signal1  Signal2  Signal3
0   0A  active   13       NaN      on
1   0A  active   NaN      0.1      NaN
2   0A  active   4        0.3      NaN
3   0A  inactive 11       NaN      off
4   0A  inactive NaN      4.5      NaN
5   1C  active   22       NaN      on
6   1C  active   25       2.0      NaN
7   1C  active   NaN      3.0      on
8   1C  inactive 19       NaN      NaN

填充和插值 ID“0A”后的所需输出:

    ID  MODE      Signal1     Signal2   Signal3
0   0A  active    13.0        NaN       on
1   0A  active    8.5         0.1       on
2   0A  active    4.0         0.3       on
3   0A  inactive  11.0        NaN       off
4   0A  inactive  11.0        4.5       off

ID“0A”的 dropna 后的期望输出:

    ID  MODE    Signal1  Signal2    Signal3
0   0A  active  8.5      0.1        on
1   0A  active  4.0      0.3        on
    ID  MODE      Signal1    Signal2    Signal3
0   0A  inactive  11         4.5        off

IIUC,你想要:

  1. groupby ID 和 MODE 列并插入所有数字列
  2. groupby ID 和 MODE 列并填充所有非数字列
import numpy as np

#replace string "NaN" with numpy.nan
df = df.replace("NaN", np.nan)

numeric = df.filter(like="Signal").select_dtypes(np.number).columns
others = df.filter(like="Signal").select_dtypes(None,np.number).columns

df[numeric] = df.groupby(["ID", "MODE"])[numeric].transform(pd.Series.interpolate, limit_direction="forward")
df[others] = df.groupby(["ID", "MODE"])[others].transform("ffill")

>>> df
   ID      MODE  Signal1  Signal2 Signal3
0  0A    active     13.0      NaN      on
1  0A    active      8.5      0.1      on
2  0A    active      4.0      0.3      on
3  0A  inactive     11.0      NaN     off
4  0A  inactive     11.0      4.5     off
5  1C    active     22.0      NaN      on
6  1C    active     25.0      2.0      on
7  1C    active     25.0      3.0      on
8  1C  inactive     19.0      NaN     NaN

>>> df.dropna()
   ID      MODE  Signal1  Signal2 Signal3
1  0A    active      8.5      0.1      on
2  0A    active      4.0      0.3      on
4  0A  inactive     11.0      4.5     off
6  1C    active     25.0      2.0      on
7  1C    active     25.0      3.0      on

首先用 :

的平均值填充 Signal1
df['Signal1']=df.groupby(['ID','MODE'])['Signal1'].apply(lambda x:x.fillna(x.mean()))

下一步groupby获取Signal3并合并

signal3 = df[['ID','MODE','Signal3']].dropna().drop_duplicates()
signal3 = signal3.rename(columns={'Signal3':'Signal3_new'})
df2 = pd.merge(df,signal3, how='left', on=['ID','MODE'])

Signal3 中填写 Signal3_new 或使用 Signal3_new