TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
我已经提到了帖子 , here and 。不要将其标记为重复。
我正在处理二元分类问题,我的数据集包含分类列和数字列。
但是,某些分类列混合了数字值和字符串值。不过,它们仅表示类别名称。
例如,我有一个名为 biz_category
的列,它的值类似于 A,B,C,4,5
等
我猜下面的错误是由于 4 and 5
.
这样的值引起的
因此,我尝试了 belowm 将它们转换为 category
数据类型。 (但还是不行)
cols=X_train.select_dtypes(exclude='int').columns.to_list()
X_train[cols]=X_train[cols].astype('category')
我的数据信息如下所示
<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 21 to 965
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Feature_A 683 non-null category
1 Product Classification 683 non-null category
2 Industry 683 non-null category
3 DIVISION 683 non-null category
4 biz_category 683 non-null category
5 Country 683 non-null category
6 Product segment 683 non-null category
7 SUBREGION 683 non-null category
8 Quantity 1st year 683 non-null int64
dtypes: category(8), int64(1)
所以,在 dtype 转换之后,当我尝试下面的 SMOTENC 时,我得到一个错误
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
cat_index = [0,1,2,3,4,5,6,7]
# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE, SMOTENC
sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
这会导致如下所示的错误
--------------------------------------------------------------------------- TypeError Traceback (most recent call
last)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py
in _unique_python(values, return_inverse)
134
--> 135 uniques = sorted(uniques_set)
136 uniques.extend(missing_values.to_list())
TypeError: '<' not supported between instances of 'str' and 'int'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last)
C:\Users\SATHAP~1\AppData\Local\Temp/ipykernel_31168/1931674352.py in
6 from imblearn.over_sampling import SMOTE, SMOTENC
7 sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority')
----> 8 X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
9
10 print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
~\AppData\Roaming\Python\Python39\site-packages\imblearn\base.py in
fit_resample(self, X, y)
81 )
82
---> 83 output = self.fit_resample(X, y)
84
85 y = (
~\AppData\Roaming\Python\Python39\site-packages\imblearn\over_sampling_smote\base.py
in fit_resample(self, X, y)
511
512 # the input of the OneHotEncoder needs to be dense
--> 513 X_ohe = self.ohe.fit_transform(
514 X_categorical.toarray() if sparse.issparse(X_categorical) else X_categorical
515 )
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py
in fit_transform(self, X, y)
486 """
487 self._validate_keywords()
--> 488 return super().fit_transform(X, y)
489
490 def transform(self, X):
~\AppData\Roaming\Python\Python39\site-packages\sklearn\base.py in
fit_transform(self, X, y, **fit_params)
850 if y is None:
851 # fit method of arity 1 (unsupervised transformation)
--> 852 return self.fit(X, **fit_params).transform(X)
853 else:
854 # fit method of arity 2 (supervised transformation)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py
in fit(self, X, y)
459 """
460 self._validate_keywords()
--> 461 self.fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan")
462 self.drop_idx = self._compute_drop_idx()
463 return self
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py
in _fit(self, X, handle_unknown, force_all_finite)
92 Xi = X_list[i]
93 if self.categories == "auto":
---> 94 cats = _unique(Xi)
95 else:
96 cats = np.array(self.categories[i], dtype=Xi.dtype)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py
in _unique(values, return_inverse)
29 """
30 if values.dtype == object:
---> 31 return _unique_python(values, return_inverse=return_inverse)
32 # numerical
33 out = np.unique(values, return_inverse=return_inverse)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py
in _unique_python(values, return_inverse)
138 except TypeError:
139 types = sorted(t.qualname for t in set(type(v) for v in values))
--> 140 raise TypeError(
141 "Encoders require their input to be uniformly "
142 f"strings or numbers. Got {types}"
TypeError: Encoders require their input to be uniformly strings or
numbers. Got ['int', 'str']
我是否也应该将 y_train
转换为分类?目前,它是 int64
。
请帮忙
问题原因
SMOTE
要求每个 categorical/numerical 列中的值具有统一的数据类型。本质上,在这种情况下,您的 biz_category
列中的任何列中都不能有混合数据类型。此外,仅将列转换为分类类型并不一定意味着该列中的值将具有统一的数据类型。
可能的解决方案
这个问题的一个可能的解决方案是 re-encode 那些具有混合数据类型的列中的值,例如您可以使用 lableencoder 但我认为在您的情况下只需将 dtype
更改为 string
也可以。
我已经提到了帖子
我正在处理二元分类问题,我的数据集包含分类列和数字列。
但是,某些分类列混合了数字值和字符串值。不过,它们仅表示类别名称。
例如,我有一个名为 biz_category
的列,它的值类似于 A,B,C,4,5
等
我猜下面的错误是由于 4 and 5
.
因此,我尝试了 belowm 将它们转换为 category
数据类型。 (但还是不行)
cols=X_train.select_dtypes(exclude='int').columns.to_list()
X_train[cols]=X_train[cols].astype('category')
我的数据信息如下所示
<class 'pandas.core.frame.DataFrame'>
Int64Index: 683 entries, 21 to 965
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Feature_A 683 non-null category
1 Product Classification 683 non-null category
2 Industry 683 non-null category
3 DIVISION 683 non-null category
4 biz_category 683 non-null category
5 Country 683 non-null category
6 Product segment 683 non-null category
7 SUBREGION 683 non-null category
8 Quantity 1st year 683 non-null int64
dtypes: category(8), int64(1)
所以,在 dtype 转换之后,当我尝试下面的 SMOTENC 时,我得到一个错误
print("Before OverSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before OverSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
cat_index = [0,1,2,3,4,5,6,7]
# import SMOTE module from imblearn library
# pip install imblearn (if you don't have imblearn in your system)
from imblearn.over_sampling import SMOTE, SMOTENC
sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
这会导致如下所示的错误
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique_python(values, return_inverse) 134 --> 135 uniques = sorted(uniques_set) 136 uniques.extend(missing_values.to_list())
TypeError: '<' not supported between instances of 'str' and 'int'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last) C:\Users\SATHAP~1\AppData\Local\Temp/ipykernel_31168/1931674352.py in 6 from imblearn.over_sampling import SMOTE, SMOTENC 7 sm = SMOTENC(categorical_features=cat_index,random_state = 2,sampling_strategy = 'minority') ----> 8 X_train_res, y_train_res = sm.fit_resample(X_train, y_train) 9 10 print('After OverSampling, the shape of train_X: {}'.format(X_train_res.shape))
~\AppData\Roaming\Python\Python39\site-packages\imblearn\base.py in fit_resample(self, X, y) 81 ) 82 ---> 83 output = self.fit_resample(X, y) 84 85 y = (
~\AppData\Roaming\Python\Python39\site-packages\imblearn\over_sampling_smote\base.py in fit_resample(self, X, y) 511 512 # the input of the OneHotEncoder needs to be dense --> 513 X_ohe = self.ohe.fit_transform( 514 X_categorical.toarray() if sparse.issparse(X_categorical) else X_categorical 515 )
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in fit_transform(self, X, y) 486 """ 487 self._validate_keywords() --> 488 return super().fit_transform(X, y) 489 490 def transform(self, X):
~\AppData\Roaming\Python\Python39\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params) 850 if y is None: 851 # fit method of arity 1 (unsupervised transformation) --> 852 return self.fit(X, **fit_params).transform(X) 853 else: 854 # fit method of arity 2 (supervised transformation)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in fit(self, X, y) 459 """ 460 self._validate_keywords() --> 461 self.fit(X, handle_unknown=self.handle_unknown, force_all_finite="allow-nan") 462 self.drop_idx = self._compute_drop_idx() 463 return self
~\AppData\Roaming\Python\Python39\site-packages\sklearn\preprocessing_encoders.py in _fit(self, X, handle_unknown, force_all_finite) 92 Xi = X_list[i] 93 if self.categories == "auto": ---> 94 cats = _unique(Xi) 95 else: 96 cats = np.array(self.categories[i], dtype=Xi.dtype)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique(values, return_inverse) 29 """ 30 if values.dtype == object: ---> 31 return _unique_python(values, return_inverse=return_inverse) 32 # numerical 33 out = np.unique(values, return_inverse=return_inverse)
~\AppData\Roaming\Python\Python39\site-packages\sklearn\utils_encode.py in _unique_python(values, return_inverse) 138 except TypeError: 139 types = sorted(t.qualname for t in set(type(v) for v in values)) --> 140 raise TypeError( 141 "Encoders require their input to be uniformly " 142 f"strings or numbers. Got {types}"
TypeError: Encoders require their input to be uniformly strings or numbers. Got ['int', 'str']
我是否也应该将 y_train
转换为分类?目前,它是 int64
。
请帮忙
问题原因
SMOTE
要求每个 categorical/numerical 列中的值具有统一的数据类型。本质上,在这种情况下,您的 biz_category
列中的任何列中都不能有混合数据类型。此外,仅将列转换为分类类型并不一定意味着该列中的值将具有统一的数据类型。
可能的解决方案
这个问题的一个可能的解决方案是 re-encode 那些具有混合数据类型的列中的值,例如您可以使用 lableencoder 但我认为在您的情况下只需将 dtype
更改为 string
也可以。