根据来自另一个数据框的多列条件创建列
Create column based on multiple column conditions from another dataframe
假设我有两个数据框——条件和数据。
import pandas as pd
conditions = pd.DataFrame({'class': [1,2,3,4,4,5,5,4,4,5,5,5],
'primary_lower': [0,0,0,160,160,160,160,160,160,160,160,800],
'primary_upper':[9999,9999,9999,480,480,480,480,480,480,480,480,4000],
'secondary_lower':[0,0,0,3500,6100,3500,6100,0,4800,0,4800,10],
'secondary_upper':[9999,9999,9999,4700,9999,4700,9999,4699,6000,4699,6000,3000],
'group':['A','A','A','B','B','B','B','C','C','C','C','C']})
data = pd.DataFrame({'class':[1,1,4,4,5,5,2],
'primary':[2000,9100,1100,170,300,210,1000],
'secondary':[1232,3400,2400,380,3600,4800,8600]})
我想在 "data" table 中生成一个新列(组),根据 "conditions" [=44] 中提供的条件为每一行分配一个组=].
条件 table 的结构使得每个组中的行由 "OR" 连接,列由 "AND" 连接。例如,要分配组 "B":
(class = 4 AND 160<=primary<=480 AND 3500<=secondary<=4700)
或
(class = 4 AND 160<=primary<=480 AND 6100<=secondary<=9999)
或
(class = 5 AND 160<=primary<=480 AND 3500<=secondary<=4700)
或
(class = 5 AND 160<=primary<=480 AND 6100<=secondary<=9999)
任何不符合任何条件的行都将分配到组 "Other"。因此,最终的数据框将如下所示:
+-------+---------+-----------+-------+
| class | primary | secondary | group |
+-------+---------+-----------+-------+
| 1 | 2000 | 1232 | A |
| 1 | 9100 | 3400 | A |
| 4 | 1100 | 2400 | Other |
| 4 | 170 | 380 | C |
| 5 | 300 | 3600 | B |
| 5 | 210 | 4800 | C |
| 2 | 1000 | 8600 | A |
+-------+---------+-----------+-------+
您可以迭代一个 GroupBy
对象并获取每个组中掩码的并集:
for key, grp in conditions.groupby('group'):
cols = ['class', 'primary_lower', 'primary_upper',
'secondary_lower', 'secondary_upper']
masks = (data['class'].eq(cls) & \
data['primary'].between(prim_lower, prim_upper) & \
data['secondary'].between(sec_lower, sec_upper) \
for cls, prim_lower, prim_upper, sec_lower, sec_upper in \
grp[cols].itertuples(index=False))
data.loc[pd.concat(masks, axis=1).any(1), 'group'] = key
data['group'] = data['group'].fillna('Other')
结果:
print(data)
class primary secondary group
0 1 2000 1232 A
1 1 9100 3400 A
2 4 1100 2400 Other
3 4 170 380 C
4 5 300 3600 C
5 5 210 4800 C
6 2 1000 8600 A
注意 index=4
与您想要的输出结果不同,因为有多个条件满足数据。
假设我有两个数据框——条件和数据。
import pandas as pd
conditions = pd.DataFrame({'class': [1,2,3,4,4,5,5,4,4,5,5,5],
'primary_lower': [0,0,0,160,160,160,160,160,160,160,160,800],
'primary_upper':[9999,9999,9999,480,480,480,480,480,480,480,480,4000],
'secondary_lower':[0,0,0,3500,6100,3500,6100,0,4800,0,4800,10],
'secondary_upper':[9999,9999,9999,4700,9999,4700,9999,4699,6000,4699,6000,3000],
'group':['A','A','A','B','B','B','B','C','C','C','C','C']})
data = pd.DataFrame({'class':[1,1,4,4,5,5,2],
'primary':[2000,9100,1100,170,300,210,1000],
'secondary':[1232,3400,2400,380,3600,4800,8600]})
我想在 "data" table 中生成一个新列(组),根据 "conditions" [=44] 中提供的条件为每一行分配一个组=].
条件 table 的结构使得每个组中的行由 "OR" 连接,列由 "AND" 连接。例如,要分配组 "B":
(class = 4 AND 160<=primary<=480 AND 3500<=secondary<=4700)
或
(class = 4 AND 160<=primary<=480 AND 6100<=secondary<=9999)
或
(class = 5 AND 160<=primary<=480 AND 3500<=secondary<=4700)
或
(class = 5 AND 160<=primary<=480 AND 6100<=secondary<=9999)
任何不符合任何条件的行都将分配到组 "Other"。因此,最终的数据框将如下所示:
+-------+---------+-----------+-------+
| class | primary | secondary | group |
+-------+---------+-----------+-------+
| 1 | 2000 | 1232 | A |
| 1 | 9100 | 3400 | A |
| 4 | 1100 | 2400 | Other |
| 4 | 170 | 380 | C |
| 5 | 300 | 3600 | B |
| 5 | 210 | 4800 | C |
| 2 | 1000 | 8600 | A |
+-------+---------+-----------+-------+
您可以迭代一个 GroupBy
对象并获取每个组中掩码的并集:
for key, grp in conditions.groupby('group'):
cols = ['class', 'primary_lower', 'primary_upper',
'secondary_lower', 'secondary_upper']
masks = (data['class'].eq(cls) & \
data['primary'].between(prim_lower, prim_upper) & \
data['secondary'].between(sec_lower, sec_upper) \
for cls, prim_lower, prim_upper, sec_lower, sec_upper in \
grp[cols].itertuples(index=False))
data.loc[pd.concat(masks, axis=1).any(1), 'group'] = key
data['group'] = data['group'].fillna('Other')
结果:
print(data)
class primary secondary group
0 1 2000 1232 A
1 1 9100 3400 A
2 4 1100 2400 Other
3 4 170 380 C
4 5 300 3600 C
5 5 210 4800 C
6 2 1000 8600 A
注意 index=4
与您想要的输出结果不同,因为有多个条件满足数据。