使用分类列作为条件的特征工程薪资数据
Feature Engineering Salary Data using Categorical Column as a condition
考虑到分类列,需要将工资金额转换为年化工资:
- 'M' - 每月
- 'Y' - 每年
- 'W' - 每周
- 'B' - 每两周一次
df = pd.DataFrame({'Name':['A','B','C','D','E'],
'sal_amt':[4500,50000,2000,3000,5000],
'sal_md':['M','Y','W','B','M']})
df.head()
#defined a function for my problem...
def func(row):
if row['sal_md'] == 'M':
return (row['sal_amt']*12)
elif row['sal_md'] =='Y':
return row['sal_amt']
elif row['sal_md'] == 'H':
return (row['sal_amt']*8760)
elif row['sal_md'] == 'W':
return (row['sal_amt']*52)
elif row['sal_md'] == 'B':
return (row['sal_amt']*26)
elif row['sal_md'] == 'S':
return row['sal_amt']
elif row['sal_md'] == 'A':
return row['sal_amt']
df['sal_annual'] = df.apply(func,axis=1)
https://i.stack.imgur.com/INXva.png
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'Name':['A','B','C','D','E'],
'sal_amt':[4500,50000,2000,3000,5000],
'sal_md':['M','Y','W','B','M']})
In [3]: multiplier_dict = {'M':12, 'Y':1, 'W':52, 'B':26}
In [4]: df['sal_multiplier'] = df.sal_md.map(multiplier_dict)
In [5]: df['sal_annual'] = df.sal_amt*df.sal_multiplier
In [6]: df.head()
Out[6]:
Name sal_amt sal_md sal_multiplier sal_annual
0 A 4500 M 12 54000
1 B 50000 Y 1 50000
2 C 2000 W 52 104000
3 D 3000 B 26 78000
4 E 5000 M 12 60000
不完全是你问的,但以一种简单的 pythonic 方式准确地解决了你的问题。
考虑到分类列,需要将工资金额转换为年化工资:
- 'M' - 每月
- 'Y' - 每年
- 'W' - 每周
- 'B' - 每两周一次
df = pd.DataFrame({'Name':['A','B','C','D','E'],
'sal_amt':[4500,50000,2000,3000,5000],
'sal_md':['M','Y','W','B','M']})
df.head()
#defined a function for my problem...
def func(row):
if row['sal_md'] == 'M':
return (row['sal_amt']*12)
elif row['sal_md'] =='Y':
return row['sal_amt']
elif row['sal_md'] == 'H':
return (row['sal_amt']*8760)
elif row['sal_md'] == 'W':
return (row['sal_amt']*52)
elif row['sal_md'] == 'B':
return (row['sal_amt']*26)
elif row['sal_md'] == 'S':
return row['sal_amt']
elif row['sal_md'] == 'A':
return row['sal_amt']
df['sal_annual'] = df.apply(func,axis=1)
https://i.stack.imgur.com/INXva.png
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'Name':['A','B','C','D','E'],
'sal_amt':[4500,50000,2000,3000,5000],
'sal_md':['M','Y','W','B','M']})
In [3]: multiplier_dict = {'M':12, 'Y':1, 'W':52, 'B':26}
In [4]: df['sal_multiplier'] = df.sal_md.map(multiplier_dict)
In [5]: df['sal_annual'] = df.sal_amt*df.sal_multiplier
In [6]: df.head()
Out[6]:
Name sal_amt sal_md sal_multiplier sal_annual
0 A 4500 M 12 54000
1 B 50000 Y 1 50000
2 C 2000 W 52 104000
3 D 3000 B 26 78000
4 E 5000 M 12 60000
不完全是你问的,但以一种简单的 pythonic 方式准确地解决了你的问题。