Pandas 分位数函数未返回正确数量的给定分位数

Question

我有一个包含 2,000 多条记录的数据框，其中包含具有不同余额的多列。根据余额我想分配给一个bucket。

尝试将每个余额列拆分为一个分位数并具有以下存储桶0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9具体而言，将余额转换为以下存储桶：前 10%、前 20%、前 30%，等等...

如果我理解正确，只要有超过 10 条记录，它就应该根据线性插值将每条记录存储在一个百分位数中。

所以我运行以下内容：

score_quantiles =  df.quantile(q=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
score_quantiles.to_dict()



# Arguments (x = value, p = field (i.e bal 1, bal2, bal3) , d = score_quantiles)

def dlpScore(x,p,d):
    if pd.isnull(x) == True:
        return 0 
    elif int(x) == 0:
        return 0
    elif x <= d[p][0.1]:
        return 1
    elif x <= d[p][0.2]:
        return 2
    elif x <= d[p][0.3]: 
        return 3
    elif x <= d[p][0.4]: 
        return 4
    elif x <= d[p][0.5]: 
        return 5
    elif x <= d[p][0.6]: 
        return 6
    elif x <= d[p][0.7]: 
        return 7
    elif x <= d[p][0.8]: 
        return 8
    elif x <= d[p][0.9]: 
        return 9
    else:
        return 10



df['SCORE_BAL1'] = df['bal1'].apply(dlpScore, args=('bal1',score_quantiles,))

问题是，在某些列上它给了我所有的桶，在其他的上它只给了我几个：

有没有办法确保它创建所有的桶？我可能在这里遗漏了一些东西。

Answer 1

如果您想确保在 'buckets' 中获得相似的分布，您可能想尝试 pandas qcut 函数。 full documentation is here.

要在您的代码中使用它并获得十分位数，例如您可以这样做

n_buckets=10
df['quantile'] = pd.qcut(df['target_column'], q=n_buckets)

如果你想应用特定的标签，你可以这样做

n_buckets=10
df['quantile'] = pd.qcut(df['target_column'], q=n_buckets, labels=range(1,n_buckets+1))

PS: 请注意，对于后一种情况，如果 qcut 无法生成所需数量的分位数（例如，因为有没有足够的值来创建分位数），如果传递的标签多于分位数，则会出现异常。

Pandas 分位数函数未返回正确数量的给定分位数

Pandas quantile function not returning the correct number of given quantiles

python

numpy

linear-interpolation

pandas