需要根据特定列的某些规则在 pandas 数据框中添加新列
Need to add new column in pandas data frame based on some rule on a particular column
我在Pandas中有一个数据框(使用Python 3.7)如下图:
print("DATA FRAME DATA= \n",bin_data_df_sorted.head(5))
# OUTPUT:
# DATA FRAME DATA=
# actuals probability
# 0 0.0 0.116375
# 1 0.0 0.239069
# 2 1.0 0.591988
# 3 0.0 0.273709
# 4 1.0 0.929855
我需要添加名为 'bucket' 的额外列,这样:
If probability value in between (0,0.1), then bucket=1
If probability value in between (0.1,0.2), then bucket=2
If probability value in between (0.2,0.3), then bucket=3
If probability value in between (0.3,0.4), then bucket=4
If probability value in between (0.4,0.5), then bucket=5
If probability value in between (0.5,0.6), then bucket=6
If probability value in between (0.6,0.7), then bucket=7
If probability value in between (0.7,0.8), then bucket=8
If probability value in between (0.8,0.9), then bucket=9
If probability value in between (0.9,1), then bucket=10
因此,输出应如下所示:
# actuals probability bucket
# 0 0.0 0.116375 2
# 1 0.0 0.239069 3
# 2 1.0 0.591988 6
# 3 0.0 0.273709 3
# 4 1.0 0.929855 10
我们该怎么做?
注意:我尝试了以下代码,但它无法正常工作。
> for val in bin_data_df_sorted['probability']:
> if val >= 0.0 and val <=0.1:
> bin_data_df_sorted['bucket']=1
> elif val > 0.1 and val <=0.2:
> bin_data_df_sorted['bucket']=2
> elif val > 0.2 and val <=0.3:
> bin_data_df_sorted['bucket']=3
and so on..
您可以使用 pd.cut
:
import numpy as np
bins = np.arange(0, 1.1, 0.1)
df['bucket'] = pd.cut(df.probability, bins, labels=(bins*10)[1:])
actuals probability bucket
0 0.0 0.116375 2.0
1 0.0 0.239069 3.0
2 1.0 0.591988 6.0
3 0.0 0.273709 3.0
4 1.0 0.929855 10.0
详情
pd.cut
将序列中的值分成离散区间。所以你需要指定一些标准来分箱。你可以这样做:
bins = np.arange(0,1.1, 0.1)
# array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
以及返回的垃圾箱的一些标签,在这种情况下可以使用相同的 bins
:
生成
(bins*10)[1:]
# array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
我在Pandas中有一个数据框(使用Python 3.7)如下图:
print("DATA FRAME DATA= \n",bin_data_df_sorted.head(5))
# OUTPUT:
# DATA FRAME DATA=
# actuals probability
# 0 0.0 0.116375
# 1 0.0 0.239069
# 2 1.0 0.591988
# 3 0.0 0.273709
# 4 1.0 0.929855
我需要添加名为 'bucket' 的额外列,这样:
If probability value in between (0,0.1), then bucket=1
If probability value in between (0.1,0.2), then bucket=2
If probability value in between (0.2,0.3), then bucket=3
If probability value in between (0.3,0.4), then bucket=4
If probability value in between (0.4,0.5), then bucket=5
If probability value in between (0.5,0.6), then bucket=6
If probability value in between (0.6,0.7), then bucket=7
If probability value in between (0.7,0.8), then bucket=8
If probability value in between (0.8,0.9), then bucket=9
If probability value in between (0.9,1), then bucket=10
因此,输出应如下所示:
# actuals probability bucket
# 0 0.0 0.116375 2
# 1 0.0 0.239069 3
# 2 1.0 0.591988 6
# 3 0.0 0.273709 3
# 4 1.0 0.929855 10
我们该怎么做?
注意:我尝试了以下代码,但它无法正常工作。
> for val in bin_data_df_sorted['probability']:
> if val >= 0.0 and val <=0.1:
> bin_data_df_sorted['bucket']=1
> elif val > 0.1 and val <=0.2:
> bin_data_df_sorted['bucket']=2
> elif val > 0.2 and val <=0.3:
> bin_data_df_sorted['bucket']=3
and so on..
您可以使用 pd.cut
:
import numpy as np
bins = np.arange(0, 1.1, 0.1)
df['bucket'] = pd.cut(df.probability, bins, labels=(bins*10)[1:])
actuals probability bucket
0 0.0 0.116375 2.0
1 0.0 0.239069 3.0
2 1.0 0.591988 6.0
3 0.0 0.273709 3.0
4 1.0 0.929855 10.0
详情
pd.cut
将序列中的值分成离散区间。所以你需要指定一些标准来分箱。你可以这样做:
bins = np.arange(0,1.1, 0.1)
# array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])
以及返回的垃圾箱的一些标签,在这种情况下可以使用相同的 bins
:
(bins*10)[1:]
# array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])