根据条件创建 bin
Creating bins based on condition
我的原始数据集类似于下面的示例:
| id | old_a | new_a | old_b | new_b | ratio_a | ratio_b |
|----|-------|-------|-------|-------|----------|---------|
| 1 | 350 | 6 | 35 | 0 | 58.33333 | Inf |
| 2 | 164 | 79 | 6 | 2 | 2.075949 | 3 |
| 3 | 10 | 0 | 1 | 1 | Inf | 1 |
| 4 | 120 | 1 | 10 | 0 | 120 | Inf |
这是数据框:
df=[[1,350,6,35,0],[2,164,79,6,2],[3,10,0,1,1],[4,120,1,10,0]]
df= pd.DataFrame(df,columns=['id','old_a','new_a','old_b','new_b'])
我已经使用以下代码获得了 'ratio_a' 和 'ratio_b' 列(如 table 所示):
df['ratio_a']= df['old_a']/df['new_a']
df['ratio_b']= df['old_b']/df['new_b']
接下来,我想再创建两列,其中包含 ratio_a 和 ratio_b 的值所在的数字范围。为此,我编写了以下代码:
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)
我遇到的一个问题是,如果 ratio_a 和 ratio_b 中的任何值大于 100,它应该落在“>100”的桶中。我怎样才能做到这一点?
我的最终结果应该如下所示:
| id | old_a | new_a | old_b | new_b | ratio_a | ratio_b | a_range | b_range |
|----|-------|-------|-------|-------|----------|---------|---------|---------|
| 1 | 350 | 6 | 35 | 0 | 58.33333 | Inf | 40-50 | NaN |
| 2 | 164 | 79 | 6 | 2 | 2.075949 | 3 | 0-10 | 0-10 |
| 3 | 10 | 0 | 1 | 1 | Inf | 1 | NaN | 0-10 |
| 4 | 120 | 1 | 10 | 0 | 120 | Inf | >100 | NaN |
一种可能的解决方案:
bins = [0,10,20,30,40,50,60,70,80,90,100,np.inf]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
labels[-1]=">100"
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)
结果:
id old_a new_a old_b new_b ratio_a ratio_b a_range b_range
1 350 6 35 0 58.333333 inf 50-60 NaN
2 164 79 6 2 2.075949 3.0 0-10 0-10
3 10 0 1 1 inf 1.0 NaN 0-10
4 120 1 10 0 120.000000 inf >100 NaN
我的原始数据集类似于下面的示例:
| id | old_a | new_a | old_b | new_b | ratio_a | ratio_b |
|----|-------|-------|-------|-------|----------|---------|
| 1 | 350 | 6 | 35 | 0 | 58.33333 | Inf |
| 2 | 164 | 79 | 6 | 2 | 2.075949 | 3 |
| 3 | 10 | 0 | 1 | 1 | Inf | 1 |
| 4 | 120 | 1 | 10 | 0 | 120 | Inf |
这是数据框:
df=[[1,350,6,35,0],[2,164,79,6,2],[3,10,0,1,1],[4,120,1,10,0]]
df= pd.DataFrame(df,columns=['id','old_a','new_a','old_b','new_b'])
我已经使用以下代码获得了 'ratio_a' 和 'ratio_b' 列(如 table 所示):
df['ratio_a']= df['old_a']/df['new_a']
df['ratio_b']= df['old_b']/df['new_b']
接下来,我想再创建两列,其中包含 ratio_a 和 ratio_b 的值所在的数字范围。为此,我编写了以下代码:
bins = [0,10,20,30,40,50,60,70,80,90,100]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)
我遇到的一个问题是,如果 ratio_a 和 ratio_b 中的任何值大于 100,它应该落在“>100”的桶中。我怎样才能做到这一点? 我的最终结果应该如下所示:
| id | old_a | new_a | old_b | new_b | ratio_a | ratio_b | a_range | b_range |
|----|-------|-------|-------|-------|----------|---------|---------|---------|
| 1 | 350 | 6 | 35 | 0 | 58.33333 | Inf | 40-50 | NaN |
| 2 | 164 | 79 | 6 | 2 | 2.075949 | 3 | 0-10 | 0-10 |
| 3 | 10 | 0 | 1 | 1 | Inf | 1 | NaN | 0-10 |
| 4 | 120 | 1 | 10 | 0 | 120 | Inf | >100 | NaN |
一种可能的解决方案:
bins = [0,10,20,30,40,50,60,70,80,90,100,np.inf]
labels = ['{}-{}'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
labels[-1]=">100"
df['a_range'] = pd.cut(df['ratio_a'], bins=bins, labels=labels, include_lowest=True)
df['b_range'] = pd.cut(df['ratio_b'], bins=bins, labels=labels, include_lowest=True)
结果:
id old_a new_a old_b new_b ratio_a ratio_b a_range b_range
1 350 6 35 0 58.333333 inf 50-60 NaN
2 164 79 6 2 2.075949 3.0 0-10 0-10
3 10 0 1 1 inf 1.0 NaN 0-10
4 120 1 10 0 120.000000 inf >100 NaN