用平均值填充 nan 值的更快方法

A faster way to fill nan values with average

我已经关闭了上一个问题,因此重新发布它时会提供更多上下文。我在相对较大 (59 gb) 的数据集上 运行 执行此命令。使用 (800,000, 10,500) 的形状,我注意到我的 aws ec2 实例 df.fillna(df.mean()) 上的 运行ning 花费了非常长的时间,4 小时后我刚刚取消了来自 运行ning 的单元格。有没有更快的方法来计算平均值并填充每列的每个 nan 值?

这是一组样本数据

d = {'B19325_038E': {409606: 9.0, 403811: 53.0, 400166: 17.0, 402573: 105.0, 400130: 43.0, 404907: 21.0, 406751: 15.0, 403850: 39.0, 404089: 81.0, 409843: np.nan}, 'B08302_014E': {409606: 2.0, 403811: 156.0, 400166: 64.0, 402573: 211.0, 400130: 140.0, 404907: 90.0, 406751: 148.0, 403850: 71.0, 404089: 341.0, 409843: 91.0}, 'B17010I_026E': {409606: np.nan, 403811: 9.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: 21.0}, 'B17015_009E': {409606: 30.0, 403811: 18.0, 400166: 12.0, 402573: 5.0, 400130: 6.0, 404907: 11.0, 406751: 23.0, 403850: 49.0, 404089: 37.0, 409843: 60.0}, 'B06003_004E': {409606: 1552.0, 403811: 3562.0, 400166: 2536.0, 402573: 4911.0, 400130: 1913.0, 404907: 1888.0, 406751: 4264.0, 403850: 2087.0, 404089: 1443.0, 409843: 867.0}, 'B15001_038E': {409606: 46.0, 403811: 104.0, 400166: 89.0, 402573: 120.0, 400130: 61.0, 404907: 14.0, 406751: 60.0, 403850: 198.0, 404089: 97.0, 409843: 25.0}, 'B08130_006E': {409606: 280.0, 403811: 2325.0, 400166: 1381.0, 402573: 2907.0, 400130: 1300.0, 404907: 1528.0, 406751: 2502.0, 403850: 1278.0, 404089: 1986.0, 409843: 308.0}, 'B19201_002E': {409606: 80.0, 403811: 75.0, 400166: 24.0, 402573: 54.0, 400130: np.nan, 404907: np.nan, 406751: 43.0, 403850: 62.0, 404089: 32.0, 409843: 33.0}, 'B19325_087E': {409606: 35.0, 403811: 29.0, 400166: 33.0, 402573: 72.0, 400130: 20.0, 404907: np.nan, 406751: 39.0, 403850: 40.0, 404089: 40.0, 409843: 5.0}, 'B06003_008E': {409606: 106.0, 403811: 458.0, 400166: 296.0, 402573: 505.0, 400130: 277.0, 404907: 804.0, 406751: 1037.0, 403850: 726.0, 404089: 1854.0, 409843: 80.0}, 'B16006_003E': {409606: 30.0, 403811: 525.0, 400166: 160.0, 402573: 33.0, 400130: 386.0, 404907: 2.0, 406751: 55.0, 403850: 121.0, 404089: 686.0, 409843: 228.0}, 'C14007A_004E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_005E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C14007A_003E': {409606: np.nan, 403811: np.nan, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: np.nan, 403850: np.nan, 404089: np.nan, 409843: np.nan}, 'C21001I_003E': {409606: 31.0, 403811: 287.0, 400166: 86.0, 402573: 25.0, 400130: 235.0, 404907: 35.0, 406751: 32.0, 403850: 73.0, 404089: 384.0, 409843: 84.0}, 'C21001I_006E': {409606: np.nan, 403811: 35.0, 400166: np.nan, 402573: np.nan, 400130: np.nan, 404907: np.nan, 406751: 13.0, 403850: 17.0, 404089: 19.0, 409843: 6.0}}

df = pd.DataFrame(data=d)

这是我的机器使用 htop 的图片,向您展示它在 运行ning df.fillna(df.mean()

时的状态

如您所见,它似乎在工作,但我根本没有看到内存波动,因此可能被冻结了?很难说,让它继续运行超过 4 小时是浪费金钱。

有没有办法并行化 df.fillna(df.mean()) 使其 运行 更快?

在这里提供更多的上下文是我目前正在尝试的,因为到目前为止,似乎没有人知道。

def fill_nan(df, col):
    df[col].fillna(df[col].mean(),inplace=True)
    return df

col_list=all_data.columns.tolist()
l = Parallel(n_jobs=-1)(delayed(fill_nan)(df=all_data,col=cols) for cols in col_list)

问题是我收到了这个错误

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11)}

尽管有错误,但这种方法实际上会使计算速度更快吗?

你能用 numpy 试试这个版本吗?

x = df.values
avg = np.nanmean(x, axis=0)
idx = np.nonzero(np.isnan(x))
x[idx] = np.take(avg, idx[1])

您的数据框的 nan 值将自动更新,因为 x = df.values 不是您数据的副本。

根据经验,.fillna() 与所有列比有选择地应用于具有 nan 的列更昂贵。事实上,观察以下两个函数的结果:

def fill_nan1(df):
    col_list = df.columns.tolist()
    for col in col_list:
        df[col].fillna(df[col].mean(),inplace=True)
    return df

def fill_nan2(df):
    for col in df.columns[df.isnull().any(axis=0)]:
        df[col].fillna(df[col].mean(),inplace=True)
    return df

.fillna() 应用于 fill_nan1()df 的所有列(在您的情况下是如何完成的),而它仅应用于 nan 中的列 fill_nan2()timeit 两者的结果如下:

>>%timeit fill_nan1(df)
2.35 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>%timeit fill_nan2(df)
938 µs ± 8.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

此外,如果这是为了 ML 目的,请在填充 nan 值之前将数据拆分为训练和测试,因为这不仅可以加快计算速度,还可以避免错误估算。