在 sklearn GaussianNB 中使用单热编码标签时出错

Question

我有一个数据集：

[['s002'   ... 0.3509 0.2171 0.0742]
 ['s002'   ... 0.2756 0.1917 0.0747]
 ['s002'   ... 0.2847 0.1762 0.0945]
 ...
 ['s057'   ... 0.2017 0.0983 0.0905]
 ['s057'   ... 0.1917 0.0938 0.0931]
 ['s057'   ... 0.1993 0.1186 0.1018]]

's002' to 's057' are the labels (Y)

我正在使用 pandas:

读取数据集

data = pd.read_csv('data.csv').values

那么，我正在准备输入和输出：

# preparing inputs
X = []
for i in range(0, len(data)):
    X.append(data[i][3:])

# preparing outputs
y = []
for i in range(0, len(data)):
    y.append([data[i][0]])

我也在用 OneHotEncoder:

# one hot encoding
enc = OneHotEncoder()
enc.fit(y)
y = enc.transform(y).toarray()

所有这些之后，我正在拆分和转换数据：

# splitting data -> train 70%, test 15%, validation 15% (total 20400)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y, test_size=0.15,
                                                    random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                  test_size=0.17645,
                                                  random_state=1)

# converting list to ndarray and converting datatypes
X_train = np.asarray(X_train, dtype=np.float)
X_test = np.asarray(X_test, dtype=np.float)
X_val = np.asarray(X_val, dtype=np.float)
y_train = np.asarray(y_train, dtype=np.uint8)
y_test = np.asarray(y_test, dtype=np.uint8)
y_val = np.asarray(y_val, dtype=np.uint8)

我可以在 Neural Networks 和 KNN 中使用单热编码标签而不会出现任何故障。

这是我的 KNN 分类代码：

# create model
model = KNeighborsClassifier(metric="manhattan", n_neighbors=1)

# training
model.fit(X_train, y_train)

# testing
y_pred = model.predict(X_test)

print(">>> Accuracy Score (%)")
print(accuracy_score(y_test, y_pred, normalize=False) / len(y_test) * 100, '\n')

print(">>> Classification Report")
print(classification_report(y_test.argmax(axis=1), y_pred.argmax(axis=1)))

But, when I use one hot encoded labels with GaussianNB, I get ValueError: bad input shape ()

代码如下：

# create model
model = GaussianNB()

# training
model.fit(X_train, y_train)

输出：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-39-e0823d0910ae> in <module>()
      2 
      3 # training
----> 4 model.fit(X_train, y_train)
      5 
      6 # testing

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    795         return np.ravel(y)
    796 
--> 797     raise ValueError("bad input shape {0}".format(shape))
    798 
    799 

ValueError: bad input shape (14280, 51)

我找不到出现此错误的原因。

我可以在创建模型之前将 GaussianNB 与反单热编码标签一起使用：

# inverse one hot encoding
y_train = enc.inverse_transform(y_train)
y_test = enc.inverse_transform(y_test)

但是，我收到 数据转换警告 和 67% 的准确率，尽管其他模型是 80%：

>>> Accuracy Score (%)
67.25490196078432 

>>> Classification Report
              precision    recall  f1-score   support

        s002       0.22      0.34      0.27        71
        s003       0.75      0.74      0.74        57
        s004       0.61      0.74      0.67        54
        s005       0.60      0.74      0.66        62
        s007       0.53      0.79      0.64        63
        s008       0.37      0.74      0.50        66
        s010       0.87      0.93      0.90        56
        s011       0.64      0.82      0.72        60
        s012       0.62      0.76      0.68        62
        s013       0.63      0.80      0.70        59
        s015       0.67      0.62      0.65        56
        s016       0.56      0.68      0.62        53
        s017       0.83      0.80      0.81        54
        s018       0.75      0.53      0.62        62
        s019       0.90      0.83      0.87        66
        s020       0.60      0.25      0.35        61
        s021       0.58      0.50      0.54        50
        s022       0.90      0.99      0.94        76
        s024       0.86      0.75      0.80        51
        s025       0.82      0.90      0.86        50
        s026       0.93      0.76      0.84        68
        s027       0.83      0.72      0.77        75
        s028       0.84      0.88      0.86        49
        s029       0.78      0.77      0.77        69
        s030       0.79      0.77      0.78        62
        s031       0.31      0.23      0.26        66
        s032       0.26      0.08      0.12        63
        s033       0.71      0.96      0.82        55
        s034       0.72      0.34      0.46        67
        s035       0.85      0.42      0.56        67
        s036       1.00      0.98      0.99        61
        s037       0.59      0.42      0.49        64
        s038       0.64      0.45      0.53        64
        s039       0.93      0.49      0.64        55
        s040       0.80      0.71      0.75        62
        s041       0.70      0.62      0.66        50
        s042       0.97      0.91      0.94        64
        s043       1.00      0.90      0.94        67
        s044       0.71      0.80      0.75        50
        s046       0.40      0.33      0.36        55
        s047       0.40      0.56      0.47        54
        s048       0.45      0.72      0.56        54
        s049       0.65      0.46      0.53        68
        s050       0.57      0.55      0.56        53
        s051       0.52      0.76      0.62        54
        s052       0.98      0.93      0.95        57
        s053       0.98      0.89      0.93        55
        s054       0.50      0.71      0.58        70
        s055       0.98      0.85      0.91        62
        s056       0.52      0.65      0.58        49
        s057       0.74      0.60      0.66        62

    accuracy                           0.67      3060
   macro avg       0.69      0.68      0.67      3060
weighted avg       0.69      0.67      0.67      3060

/usr/local/lib/python3.6/dist-packages/sklearn/naive_bayes.py:206: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

我可以在 sklearn GaussianNB 中使用单热编码标签吗？我在哪里犯了错误？解决方法是什么？

感谢您的帮助！

Answer 1

因为 fit 期望数字标签不是单热编码标签。

只需删除这部分。

# one hot encoding
enc = OneHotEncoder()
enc.fit(y)
y = enc.transform(y).toarray()

文档：https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

fit(self, X, y, sample_weight=None)[source]

    Fit Gaussian Naive Bayes according to X, y

    Parameters

        Xarray-like, shape (n_samples, n_features)

            Training vectors, where n_samples is the number of samples and n_features is the number of features.
        yarray-like, shape (n_samples,)

            Target values.
        sample_weightarray-like, shape (n_samples,), optional (default=None)

            Weights applied to individual samples (1. for unweighted).

            New in version 0.17: Gaussian Naive Bayes supports fitting with sample_weight.

    Returns

        selfobject

在 sklearn GaussianNB 中使用单热编码标签时出错

Error when using one-hot encoded labels in sklearn GaussianNB

python

machine-learning

scikit-learn

multilabel-classification

naivebayes