Pandas 根据以另一列为条件的随机值样本替换 NaN 值

Question

假设我有一个像这样的数据框：

import pandas as pd
import numpy as np

np.random.seed(0)

df = {}
df['x'] = np.concatenate([np.random.uniform(0, 5, 4), np.random.uniform(5, 10, 4)])
df['y'] = np.concatenate([[0] * 4, [1] * 4])
df = pd.DataFrame(df)

df.loc[len(df) + 1] = [np.NaN, 0]
df.loc[len(df) + 1] = [np.NaN, 1]
df
Out[232]: 
           x    y
0   2.744068  0.0
1   3.575947  0.0
2   3.013817  0.0
3   2.724416  0.0
4   7.118274  1.0
5   8.229471  1.0
6   7.187936  1.0
7   9.458865  1.0
9        NaN  0.0
10       NaN  1.0

我想要做的是根据 y 值的 x 值的随机样本填写 NaN 值。

例如，在 y 为 0 的第 9 行中，我想将 NaN 替换为仅从 x 值中随机抽取的数字，其中 [=15= =] 是 0。实际上，我将从这个列表中抽样：

df[df['y'] == 0]['x'].dropna().values.tolist()
Out[233]: [2.7440675196366238, 3.5759468318620975, 3.0138168803582195, 2.724415914984484]

与第 10 行类似，我将仅基于 'x' 值进行采样，其中 y 为 1，而不是 0。我想不出一种以编程方式进行的方法（至少，以一种不错的做法，例如遍历数据帧行）。

我咨询过 , which shows me how I would randomly sample from all values in a column, but I need the random sample to be conditional on another column's distinct values. I've also seen answers for replacing NaNs with a conditional mean (such as this)，但我希望随机抽样，而不是使用平均值。

Answer 1

`transform` 与 `choice`

我为了可读性放弃了效率。请注意，我为每一行生成一个随机选择，但只选择我需要填充空值的数字。从理论上讲，我可以只为那些缺失值选择随机数。

def f(s):
    mask = s.isna()
    return np.where(mask, np.random.choice(s[~mask], len(s)), s)

df.assign(x=df.groupby('y')['x'].transform(f))

           x    y
0   2.744068  0.0  # <━┓
1   3.575947  0.0  #   ┃
2   3.013817  0.0  #   ┃
3   2.724416  0.0  #   ┃
4   7.118274  1.0  #   ┃
5   8.229471  1.0  # <━╋━┓
6   7.187936  1.0  #   ┃ ┃
7   9.458865  1.0  #   ┃ ┃
9   2.744068  0.0  # <━┛ ┃
10  8.229471  1.0  # <━━━┛

稍微有点迟钝，但只选择我们需要的数量。

def f(s):
    out = s.to_numpy().copy()
    mask = s.isna().to_numpy()
    out[mask] = np.random.choice(out[~mask], mask.sum())
    return out

df.assign(x=df.groupby('y')['x'].transform(f))

           x    y
0   2.744068  0.0  # <━┓
1   3.575947  0.0  #   ┃
2   3.013817  0.0  #   ┃
3   2.724416  0.0  #   ┃
4   7.118274  1.0  # <━╋━┓
5   8.229471  1.0  #   ┃ ┃
6   7.187936  1.0  #   ┃ ┃
7   9.458865  1.0  #   ┃ ┃
9   2.744068  0.0  # <━┛ ┃
10  7.118274  1.0  # <━━━┛

Pandas 根据以另一列为条件的随机值样本替换 NaN 值

Pandas Replace NaN values based on random sample of values conditional on another column

python

random

pandas

imputation

`transform` 与 `choice`

Pandas 根据以另一列为条件的随机值样本替换 NaN 值

Pandas Replace NaN values based on random sample of values conditional on another column

python

random

pandas

imputation

transform 与 choice

`transform` 与 `choice`