按 "OTHER" python 重命名频率较低的类别
Rename the less frequent categories by "OTHER" python
在我的数据框中,我有一些包含 100 多个不同类别的分类列。我想按最常见的方式对类别进行排名。我保留了前 9 个最频繁的类别,而不太频繁的类别通过以下方式自动重命名:OTHER
示例:
这是我的 df :
print(df)
Employee_number Jobrol
0 1 Sales Executive
1 2 Research Scientist
2 3 Laboratory Technician
3 4 Sales Executive
4 5 Research Scientist
5 6 Laboratory Technician
6 7 Sales Executive
7 8 Research Scientist
8 9 Laboratory Technician
9 10 Sales Executive
10 11 Research Scientist
11 12 Laboratory Technician
12 13 Sales Executive
13 14 Research Scientist
14 15 Laboratory Technician
15 16 Sales Executive
16 17 Research Scientist
17 18 Research Scientist
18 19 Manager
19 20 Human Resources
20 21 Sales Executive
valCount = df['Jobrol'].value_counts()
valCount
Sales Executive 7
Research Scientist 7
Laboratory Technician 5
Manager 1
Human Resources 1
我保留前 3 个类别,然后将其余类别重命名为 "OTHER",我应该如何进行?
谢谢。
将您的系列转换为分类,提取计数不在前 3 名的类别,添加一个新类别,例如'Other'
,然后替换之前计算的类别:
df['Jobrol'] = df['Jobrol'].astype('category')
others = df['Jobrol'].value_counts().index[3:]
label = 'Other'
df['Jobrol'] = df['Jobrol'].cat.add_categories([label])
df['Jobrol'] = df['Jobrol'].replace(others, label)
注意:很想通过df['Jobrol'].cat.rename_categories(dict.fromkeys(others, label))
重命名类别来组合类别,但这行不通将意味着多个相同标签的类别,这是不可能的。
上述解决方案可适用于按 count 过滤。例如,要仅包含计数为 1 的类别,您可以这样定义 others
:
counts = df['Jobrol'].value_counts()
others = counts[counts == 1].index
使用value_counts
with numpy.where
:
need = df['Jobrol'].value_counts().index[:3]
df['Jobrol'] = np.where(df['Jobrol'].isin(need), df['Jobrol'], 'OTHER')
valCount = df['Jobrol'].value_counts()
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
Name: Jobrol, dtype: int64
另一个解决方案:
N = 3
s = df['Jobrol'].value_counts()
valCount = s.iloc[:N].append(pd.Series(s.iloc[N:].sum(), index=['OTHER']))
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
dtype: int64
一行解决方案:
limit = 500
df['Jobrol'] = df['Jobrol'].map({x[0]: x[0] if x[1] > limit else 'other' for x in dict(df['Jobrol'].value_counts()).items()})
在我的数据框中,我有一些包含 100 多个不同类别的分类列。我想按最常见的方式对类别进行排名。我保留了前 9 个最频繁的类别,而不太频繁的类别通过以下方式自动重命名:OTHER
示例:
这是我的 df :
print(df)
Employee_number Jobrol
0 1 Sales Executive
1 2 Research Scientist
2 3 Laboratory Technician
3 4 Sales Executive
4 5 Research Scientist
5 6 Laboratory Technician
6 7 Sales Executive
7 8 Research Scientist
8 9 Laboratory Technician
9 10 Sales Executive
10 11 Research Scientist
11 12 Laboratory Technician
12 13 Sales Executive
13 14 Research Scientist
14 15 Laboratory Technician
15 16 Sales Executive
16 17 Research Scientist
17 18 Research Scientist
18 19 Manager
19 20 Human Resources
20 21 Sales Executive
valCount = df['Jobrol'].value_counts()
valCount
Sales Executive 7
Research Scientist 7
Laboratory Technician 5
Manager 1
Human Resources 1
我保留前 3 个类别,然后将其余类别重命名为 "OTHER",我应该如何进行?
谢谢。
将您的系列转换为分类,提取计数不在前 3 名的类别,添加一个新类别,例如'Other'
,然后替换之前计算的类别:
df['Jobrol'] = df['Jobrol'].astype('category')
others = df['Jobrol'].value_counts().index[3:]
label = 'Other'
df['Jobrol'] = df['Jobrol'].cat.add_categories([label])
df['Jobrol'] = df['Jobrol'].replace(others, label)
注意:很想通过df['Jobrol'].cat.rename_categories(dict.fromkeys(others, label))
重命名类别来组合类别,但这行不通将意味着多个相同标签的类别,这是不可能的。
上述解决方案可适用于按 count 过滤。例如,要仅包含计数为 1 的类别,您可以这样定义 others
:
counts = df['Jobrol'].value_counts()
others = counts[counts == 1].index
使用value_counts
with numpy.where
:
need = df['Jobrol'].value_counts().index[:3]
df['Jobrol'] = np.where(df['Jobrol'].isin(need), df['Jobrol'], 'OTHER')
valCount = df['Jobrol'].value_counts()
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
Name: Jobrol, dtype: int64
另一个解决方案:
N = 3
s = df['Jobrol'].value_counts()
valCount = s.iloc[:N].append(pd.Series(s.iloc[N:].sum(), index=['OTHER']))
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
dtype: int64
一行解决方案:
limit = 500
df['Jobrol'] = df['Jobrol'].map({x[0]: x[0] if x[1] > limit else 'other' for x in dict(df['Jobrol'].value_counts()).items()})