如何将分类数据放入分箱
How to Put Categorical Data in Bins
我有以下分类数据:
['Self employed', 'Government Dependent',
'Formally employed Private', 'Informally employed',
'Formally employed Government', 'Farming and Fishing',
'Remittance Dependent', 'Other Income',
'Don't Know/Refuse to answer', 'No Income']
如何将它们放入垃圾箱中,以便:
['Government Dependent','Formally employed Government','Formally
employed Private'] = 0
['Remittance Dependent', 'Informally employed','Self employed','Other Income'] = 1
['Dont Know/Refuse to answer', 'No Income','Farming and Fishing'] = 2
我已经知道将数值数据放入分类箱....可以反过来吗?
TRAIN = pd.read_csv("Train_v2.csv")
TRAIN['job_type'].unique()
output:
array(['Self employed', 'Government Dependent',
'Formally employed Private', 'Informally employed',
'Formally employed Government', 'Farming and Fishing',
'Remittance Dependent', 'Other Income',
'Dont Know/Refuse to answer', 'No Income'], dtype=object)
先创建字典,交换修改,最后使用Series.map
:
a = ['Self employed', 'Government Dependent',
'Formally employed Private', 'Informally employed',
'Formally employed Government', 'Farming and Fishing',
'Remittance Dependent', 'Other Income',
'Dont Know/Refuse to answer', 'No Income']
TRAIN = pd.DataFrame({'job_type':a})
#add another groups to dict
d = {0: ['Government Dependent','Formally employed Government','Formally employed Private'],
1: ['Remittance Dependent', 'Informally employed'],
2: ["Don't Know/Refuse to answer", 'No Income']}
#swap key values in dict
#
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
TRAIN['new'] = TRAIN['job_type'].map(d1)
print (TRAIN)
job_type new
0 Self employed NaN
1 Government Dependent 0.0
2 Formally employed Private 0.0
3 Informally employed 1.0
4 Formally employed Government 0.0
5 Farming and Fishing NaN
6 Remittance Dependent 1.0
7 Other Income NaN
8 Dont Know/Refuse to answer NaN
9 No Income 2.0
如果只有 0
、1
和 NaN
s 输出也工作 numpy.select
,但如果有很多组,它会很复杂和缓慢:
m1 = TRAIN['job_type'].isin(['Government Dependent','Formally employed Government','Formally employed Private'])
m2 = TRAIN['job_type'].isin(['Remittance Dependent', 'Informally employed'])
m3 = TRAIN['job_type'].isin(["Don't Know/Refuse to answer", 'No Income'])
TRAIN['new'] = np.select([m1, m2, m3], [0, 1, 2], np.nan)
如果不属于类别 0、1 或 2,您可以执行 np.where
并设置 np.nan
值。np.where
上的更多资源:
list_0 = ['Government Dependent','Formally employed Government','Formally employed Private']
list_1 = ['Remittance Dependent', 'Informally employed']
list_2 = ['Don't Know/Refuse to answer', 'No Income']
TRAIN['job_type_bin'] = np.where(TRAIN['job_type'].isin(list_0), 0, np.nan)
TRAIN['job_type_bin'] = np.where(TRAIN['job_type'].isin(list_1), 1, np.nan)
TRAIN['job_type_bin'] = np.where(TRAIN['job_type'].isin(list_1), 2, np.nan)
我有以下分类数据:
['Self employed', 'Government Dependent',
'Formally employed Private', 'Informally employed',
'Formally employed Government', 'Farming and Fishing',
'Remittance Dependent', 'Other Income',
'Don't Know/Refuse to answer', 'No Income']
如何将它们放入垃圾箱中,以便:
['Government Dependent','Formally employed Government','Formally
employed Private'] = 0
['Remittance Dependent', 'Informally employed','Self employed','Other Income'] = 1
['Dont Know/Refuse to answer', 'No Income','Farming and Fishing'] = 2
我已经知道将数值数据放入分类箱....可以反过来吗?
TRAIN = pd.read_csv("Train_v2.csv")
TRAIN['job_type'].unique()
output:
array(['Self employed', 'Government Dependent',
'Formally employed Private', 'Informally employed',
'Formally employed Government', 'Farming and Fishing',
'Remittance Dependent', 'Other Income',
'Dont Know/Refuse to answer', 'No Income'], dtype=object)
先创建字典,交换修改,最后使用Series.map
:
a = ['Self employed', 'Government Dependent',
'Formally employed Private', 'Informally employed',
'Formally employed Government', 'Farming and Fishing',
'Remittance Dependent', 'Other Income',
'Dont Know/Refuse to answer', 'No Income']
TRAIN = pd.DataFrame({'job_type':a})
#add another groups to dict
d = {0: ['Government Dependent','Formally employed Government','Formally employed Private'],
1: ['Remittance Dependent', 'Informally employed'],
2: ["Don't Know/Refuse to answer", 'No Income']}
#swap key values in dict
#
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
TRAIN['new'] = TRAIN['job_type'].map(d1)
print (TRAIN)
job_type new
0 Self employed NaN
1 Government Dependent 0.0
2 Formally employed Private 0.0
3 Informally employed 1.0
4 Formally employed Government 0.0
5 Farming and Fishing NaN
6 Remittance Dependent 1.0
7 Other Income NaN
8 Dont Know/Refuse to answer NaN
9 No Income 2.0
如果只有 0
、1
和 NaN
s 输出也工作 numpy.select
,但如果有很多组,它会很复杂和缓慢:
m1 = TRAIN['job_type'].isin(['Government Dependent','Formally employed Government','Formally employed Private'])
m2 = TRAIN['job_type'].isin(['Remittance Dependent', 'Informally employed'])
m3 = TRAIN['job_type'].isin(["Don't Know/Refuse to answer", 'No Income'])
TRAIN['new'] = np.select([m1, m2, m3], [0, 1, 2], np.nan)
如果不属于类别 0、1 或 2,您可以执行 np.where
并设置 np.nan
值。np.where
list_0 = ['Government Dependent','Formally employed Government','Formally employed Private']
list_1 = ['Remittance Dependent', 'Informally employed']
list_2 = ['Don't Know/Refuse to answer', 'No Income']
TRAIN['job_type_bin'] = np.where(TRAIN['job_type'].isin(list_0), 0, np.nan)
TRAIN['job_type_bin'] = np.where(TRAIN['job_type'].isin(list_1), 1, np.nan)
TRAIN['job_type_bin'] = np.where(TRAIN['job_type'].isin(list_1), 2, np.nan)