使用分类列中的随机值填充缺失数据 - Python

Question

我正在处理酒店预订数据集。在数据框中，有一个名为“agent”的离散数字列，它有 13.7% 的缺失值。我的直觉是删除缺失值的行，但考虑到缺失值的数量并不少，现在我想使用随机抽样插补将它们按比例替换为现有的分类变量。

我的代码是：

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

结果

前 3 行是 nan，但现在替换为。我的代码有问题吗，也许是在 lambda 语法中？

更新：感谢ti7帮我解决了问题：

new_agent = hotel['agent'].dropna() #get a series of just the available values

n_null = hotel['agent'].isnull().sum() #length of the missing entries

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace

Answer 1

.fillna() 天真地将您的函数分配给缺失值。它可以做到这一点，因为函数实际上是对象！

您可能希望以某种形式生成一个新系列，其中包含当前系列中的随机值（您通过减去长度知道形状）并将其用于缺失值。

获取一系列可用值 (.dropna())
.sample() 它重复 (replace=True) 到与缺失条目 (df["agent"].isna().sum()) 长度相同的新系列
得到.values（这是一个扁平的numpy数组）
筛选列并分配

快捷码

df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    df["agent"].isna().sum(),  # get the same number of values as are missing
    replace=True               # repeat values
).values                       # throw out the index

演示

>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
   agent  b
0    1.0  3
1    2.0  4
2    NaN  5
3    NaN  6
4   10.0  7

>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])

>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
...     df["agent"].isna().sum(),
...     replace=True
... ).values
>>> df
   agent  b
0    1.0  3
1    2.0  4
2   10.0  5
3    2.0  6
4   10.0  7

使用分类列中的随机值填充缺失数据 - Python

Fill missing data with random values from categorical column - Python

python

pandas

data-cleaning

fillna

结果

快捷码

演示