如何用分类值替换列中的一系列数字,假设数字范围是 float 类型
How to replace a range of numbers in a column with a categorical value, given that range of numbers are of the type float
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.9,0.1), 'Excellent', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.8,0.89), 'Very Good', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.7,0.79), 'Good', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.6,0.69), 'Fair', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.5,0.59), 'Satisfactory', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.4,0.49), 'Poor', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.3,0.0), 'Very Poor', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(1.01,2), 'Fatal', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(2.1,1000), 'Outliers', df['ratio_usage'])
它执行并替换了第一行代码,但会产生如下错误:
TypeError Traceback (most recent call last)
<ipython-input-269-7ad3204ddca1> in <module>()
1 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.9,0.1), 'Excellent', df['ratio_usage'])
----> 2 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.8,0.89), 'Very Good', df['ratio_usage'])
3 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.7,0.79), 'Good', df['ratio_usage'])
4 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.6,0.69), 'Fair', df['ratio_usage'])
5 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.5,0.59), 'Satisfactory', df['ratio_usage'])
~\Anaconda\lib\site-packages\pandas\core\series.py in between(self, left, right, inclusive)
3654 """
3655 if inclusive:
-> 3656 lmask = self >= left
3657 rmask = self <= right
3658 else:
~\Anaconda\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
1251
1252 with np.errstate(all='ignore'):
-> 1253 res = na_op(values, other)
1254 if is_scalar(res):
1255 raise TypeError('Could not compare {typ} type with Series'
~\Anaconda\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1138
1139 elif is_object_dtype(x.dtype):
-> 1140 result = _comp_method_OBJECT_ARRAY(op, x, y)
1141
1142 elif is_datetimelike_v_numeric(x, y):
~\Anaconda\lib\site-packages\pandas\core\ops.py in _comp_method_OBJECT_ARRAY(op, x, y)
1117 result = libops.vec_compare(x, y, op)
1118 else:
-> 1119 result = libops.scalar_compare(x, y, op)
1120 return result
1121
pandas\_libs\ops.pyx in pandas._libs.ops.scalar_compare()
TypeError: '>=' not supported between instances of 'str' and 'float'
这是一个使用 pd.cut
的解决方案,它被简化了,因为我看不到你的数据,也因为你有重叠的 bin,你需要协调。
设置
df = pd.DataFrame({'ratio_usage': [0.05, 0.8, 0.64, 0.59, 0.31]})
ratio_usage
0 0.05
1 0.80
2 0.64
3 0.59
4 0.31
pd.cut
带有垃圾箱和标签
bins = [0.0, 0.2, 0.5, 0.7, 0.9, 1.0]
labels = ["bad", "kinda bad", "average", "kinda good", "good"]
pd.cut(df.ratio_usage, bins=bins, labels=labels)
0 bad
1 kinda good
2 average
3 average
4 kinda bad
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.9,0.1), 'Excellent', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.8,0.89), 'Very Good', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.7,0.79), 'Good', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.6,0.69), 'Fair', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.5,0.59), 'Satisfactory', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.4,0.49), 'Poor', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(0.3,0.0), 'Very Poor', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(1.01,2), 'Fatal', df['ratio_usage'])
df['ratio_usage'] = np.where(df['ratio_usage'].between(2.1,1000), 'Outliers', df['ratio_usage'])
它执行并替换了第一行代码,但会产生如下错误:
TypeError Traceback (most recent call last)
<ipython-input-269-7ad3204ddca1> in <module>()
1 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.9,0.1), 'Excellent', df['ratio_usage'])
----> 2 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.8,0.89), 'Very Good', df['ratio_usage'])
3 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.7,0.79), 'Good', df['ratio_usage'])
4 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.6,0.69), 'Fair', df['ratio_usage'])
5 df['ratio_usage'] = np.where(df['ratio_usage'].between(0.5,0.59), 'Satisfactory', df['ratio_usage'])
~\Anaconda\lib\site-packages\pandas\core\series.py in between(self, left, right, inclusive)
3654 """
3655 if inclusive:
-> 3656 lmask = self >= left
3657 rmask = self <= right
3658 else:
~\Anaconda\lib\site-packages\pandas\core\ops.py in wrapper(self, other, axis)
1251
1252 with np.errstate(all='ignore'):
-> 1253 res = na_op(values, other)
1254 if is_scalar(res):
1255 raise TypeError('Could not compare {typ} type with Series'
~\Anaconda\lib\site-packages\pandas\core\ops.py in na_op(x, y)
1138
1139 elif is_object_dtype(x.dtype):
-> 1140 result = _comp_method_OBJECT_ARRAY(op, x, y)
1141
1142 elif is_datetimelike_v_numeric(x, y):
~\Anaconda\lib\site-packages\pandas\core\ops.py in _comp_method_OBJECT_ARRAY(op, x, y)
1117 result = libops.vec_compare(x, y, op)
1118 else:
-> 1119 result = libops.scalar_compare(x, y, op)
1120 return result
1121
pandas\_libs\ops.pyx in pandas._libs.ops.scalar_compare()
TypeError: '>=' not supported between instances of 'str' and 'float'
这是一个使用 pd.cut
的解决方案,它被简化了,因为我看不到你的数据,也因为你有重叠的 bin,你需要协调。
设置
df = pd.DataFrame({'ratio_usage': [0.05, 0.8, 0.64, 0.59, 0.31]})
ratio_usage
0 0.05
1 0.80
2 0.64
3 0.59
4 0.31
pd.cut
带有垃圾箱和标签
bins = [0.0, 0.2, 0.5, 0.7, 0.9, 1.0]
labels = ["bad", "kinda bad", "average", "kinda good", "good"]
pd.cut(df.ratio_usage, bins=bins, labels=labels)
0 bad
1 kinda good
2 average
3 average
4 kinda bad