python 使用 fit_transform(data) 将字符串值转换为数值

python converting string values to numerical with fit_transform(data)

嘿,我已经解决这个问题 2 小时了,有人可以解释为什么我会收到这个错误吗? 我应该转换字符串值(显示 3 个州和一个性别的 10,000 个列) 到数值但不知道是什么问题我看到有人在 Udemy 上这样做并且工作正常。

sc = StandardScaler()

X_train = sc.fit_transform(X_train)

错误:

Input In [15], in <cell line: 1>()
----> 1 X_train = sc.fit_transform(X_train)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\base.py:867, in TransformerMixin.fit_transform(self, X, y, **fit_params)
    863 # non-optimized default implementation; override when a better
    864 # method is possible for a given clustering algorithm
    865 if y is None:
    866     # fit method of arity 1 (unsupervised transformation)
--> 867     return self.fit(X, **fit_params).transform(X)
    868 else:
    869     # fit method of arity 2 (supervised transformation)
    870     return self.fit(X, y, **fit_params).transform(X)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\preprocessing\_data.py:809, in StandardScaler.fit(self, X, y, sample_weight)
    807 # Reset internal state before fitting
    808 self._reset()
--> 809 return self.partial_fit(X, y, sample_weight)

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\preprocessing\_data.py:844, in StandardScaler.partial_fit(self, X, y, sample_weight)
    812 """Online computation of mean and std on X for later scaling.
    813 
    814 All of X is processed as a single batch. This is intended for cases
   (...)
    841     Fitted scaler.
    842 """
    843 first_call = not hasattr(self, "n_samples_seen_")
--> 844 X = self._validate_data(
    845     X,
    846     accept_sparse=("csr", "csc"),
    847     dtype=FLOAT_DTYPES,
    848     force_all_finite="allow-nan",
    849     reset=first_call,
    850 )
    851 n_features = X.shape[1]
    853 if sample_weight is not None:

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\base.py:577, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    575     raise ValueError("Validation should be done on X, y or both.")
    576 elif not no_val_X and no_val_y:
--> 577     X = check_array(X, input_name="X", **check_params)
    578     out = X
    579 elif no_val_X and not no_val_y:

File ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\sklearn\utils\validation.py:856, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    854         array = array.astype(dtype, casting="unsafe", copy=False)
    855     else:
--> 856         array = np.asarray(array, order=order, dtype=dtype)
    857 except ComplexWarning as complex_warning:
    858     raise ValueError(
    859         "Complex data not supported\n{}\n".format(array)
    860     ) from complex_warning

ValueError: could not convert string to float: 'Spain'

您需要先对字符串列(分类特征)进行编码。使用 OrdinalEncoder()LabelEncoder()OneHotEncoder() 将分类列转换为数字列。您只能缩放数值变量。

好的,我明白了。

# Preform label encoding for gender variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lableencoder_X_2 = LabelEncoder()
X[:, 2] = lableencoder_X_2.fit_transform(X[:, 2])

# preform one ho encoding for geography varaible
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('ohe', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X), dtype = np.str)
X = X[:, 1:]