在 python 列中的每个组上使用 MissForest 算法填充缺失值

Question

我有大约 4000 名患者 的时间序列数据有缺失值，我想在 Python 中使用 MissForest 算法估算 NaN 值 分别在每个患者文件上。

数据如下所示：

HR	Resp	P_ID
72.0	18.0	1
NaN	15.0	1
80.0	NaN	1
NaN	16.0	1
79.5	NaN	1
NaN	19.0	2
79.5	22.5	2
NaN	NaN	2
NaN	16.0	2
85.0	NaN	3
NaN	14.5	3
76.4	NaN	3
NaN	NaN	4
80.5	19.5	4
75.3	18.0	4
NaN	21.5	4

现在，我想根据 P_ID 在列中的患者数据中估算 NaN 值。就像它会估算 P_ID = 1，然后 P_ID = 2 等等。不是对整个专栏的归责。我使用的代码会将 NaN 归因于所有患者的整个列，而不是单个患者列，然后是下一个患者。

imputer = MissForest(max_iter=12, n_jobs=-1)
X_imputed = imputer.fit_transform(df)
df1 = pd.DataFrame(X_imputed)
df1.head()

我使用以下代码在患者体内进行了均值插补，但无法弄清楚如何将其用于 MissForest。

for i in ['HR','Resp']:
    df[i] = df[i].fillna(df.groupby('P_ID')[i].transform('mean'))

一个解决方案是我为每位患者制作了 4000 个数据帧，使用 MissForest 估算它们，然后将它们组合在一起。这将是一项繁忙的任务。所以我想要一个遍历整个数据帧的解决方案。请帮忙。谢谢

Answer 1

您可以使用以下方法遍历 P_ID，然后仅对过滤后的值应用 MissForest：

for idx in df["ID"].unique():
    # check if the column "Resp" is all nan
    if not df[df.ID == idx].Resp.any():
        df.loc[df.ID == idx, "Resp"] = df.loc[df.ID == idx, "Resp"].fillna(0)
    imputer = MissForest(max_iter=12, n_jobs=-1)
    x_imp = imputer.fit_transform(df[df.ID == idx])
    df.loc[df.ID == idx, :] = x_imp

这给你：

|    |      HR |    Resp |   ID |
|---:|--------:|--------:|-----:|
|  0 | 72      | 18      |    1 |
|  1 | 79.5942 | 15      |    1 |
|  2 | 80      | 15.4617 |    1 |
|  3 | 79.5942 | 16      |    1 |
|  4 | 79.5    | 15.4617 |    1 |
|  5 | 79.5    | 19      |    2 |
|  6 | 79.5    | 22.5    |    2 |
|  7 | 79.5    | 18.9217 |    2 |
|  8 | 79.5    | 16      |    2 |
|  9 | 85      | 14.5    |    3 |
| 10 | 80.786  | 14.5    |    3 |
| 11 | 76.4    | 14.5    |    3 |
| 12 | 79.148  | 20.885  |    4 |
| 13 | 80.5    | 19.5    |    4 |
| 14 | 75.3    | 18      |    4 |
| 15 | 79.148  | 21.5    |    4 |

在 python 列中的每个组上使用 MissForest 算法填充缺失值

Fill missing values using MissForest algorithm on each group in column in python

python

time-series

dataframe

pandas

fillna