在一次热编码抛出错误后使用 K-means 预测。一个热编码影响前的列数?
Using K-means predict after one hot encoding throws error. Number of columns from before one hot encoding affecting?
我正在对具有某些分类特征的数据集使用 K 均值聚类。我有一些对非分类数据进行操作的旧代码以及执行 fit 的顺序,然后 predict 按预期工作。
所以现在我正在修改该工作代码以处理具有某些分类特征的数据集,因此需要一个热编码。这是一切都有点梨形的地方。
似乎 predict 方法调用需要执行一次热编码之前的旧列数。删除目标列后的数据集有 17 列。然后在一次热编码后它有 29 列。
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from google.colab import drive
drive.mount('/gdrive')
#Change current working directory to gdrive
%cd /gdrive
#Read files
inputFileA = r'/gdrive/My Drive/FilenameA.csv'
trainDataA = pd.read_csv(inputFileA) #creates a dataframe
print(trainDataA.shape)
#Extract training and test data
print("------------------\nShapes before dropping target column")
print(trainDataA.shape)
print(trainDataB.shape)
y_trainA = trainDataA["Revenue"]
X_trainA = trainDataA.drop(["Revenue"], axis=1) #extracting training data without target column
print("------------------\nShapes after dropping target column")
print(X_trainA.shape)
#categorical features of dataset A
categoricalFeaturesA = ["Month", "VisitorType","Weekend"]
data_processed_A = pd.get_dummies(X_trainA,prefix_sep="__",columns=categoricalFeaturesA)
print("---------------\nDataset A\n",data_processed_A.head())
data_processed_A.to_csv(r'/gdrive/My Drive/data_processed_A.csv')
#K-Means Clustering ========================================================================
#Default Mode - K=8
kmeans = KMeans()
data_processed_A_fit = data_processed_A
print("===================")
print("Shape of processed data: \n", data_processed_A_fit.shape)
data_processed_A_fit.to_csv(r'/gdrive/My Drive/data_processed_A_after_fit.csv')
kmeans.fit(data_processed_A_fit)
print("Online shoppers dataset");
print("\n============\nDataset A labels")
print(kmeans.labels_)
print("==============\n\nDataset A Clusters")
print(kmeans.cluster_centers_)
#Print Silhouette measure
print("\nDataset A silhouette_score:",silhouette_score(data_processed_A, kmeans.labels_))
df_kmeansA = data_processed_A
print(df_kmeansA.head())
print(df_kmeansA.dtypes)
kmeans_predict_trainA = kmeans.predict(df_kmeansA)
它在最后一行抛出错误:
ValueError: Incorrect number of features. Got 30 features, expected 29
所以它似乎在一次热编码之前期待数据集,但我不明白为什么。
编辑:根据要求,这里是输出。
(12330, 18)
------------------
Shapes before dropping target column
(12330, 18)
------------------
Shapes after dropping target column
(12330, 17)
---------------
Dataset A
Administrative Administrative_Duration ... Weekend__False Weekend__True
0 0 0.0 ... 1 0
1 0 0.0 ... 1 0
2 0 0.0 ... 1 0
3 0 0.0 ... 1 0
4 0 0.0 ... 0 1
[5 rows x 29 columns]
===================
Shape of processed data:
(12330, 29)
Online shoppers dataset
============
Dataset A labels
[1 1 1 ... 1 1 1]
==============
Dataset A Clusters
[[ 3.81805930e+00 1.38862225e+02 9.64959569e-01 6.74071040e+01
5.82958221e+01 2.41869720e+03 7.61833487e-03 2.22516393e-02
8.26725184e+00 5.21563342e-02 2.12398922e+00 2.27021563e+00
3.19204852e+00 3.92318059e+00 3.70619946e-02 1.35444744e-01
4.71698113e-03 3.77358491e-02 1.95417790e-02 1.04447439e-01
2.58086253e-01 3.29514825e-01 4.38005391e-02 2.96495957e-02
6.13207547e-02 1.34770889e-03 9.37331536e-01 7.85040431e-01
2.14959569e-01]
[ 1.30855956e+00 4.07496939e+01 2.02343866e-01 1.04729641e+01
1.08400831e+01 2.55187795e+02 3.36811071e-02 5.91174011e-02
3.49331537e+00 6.83281412e-02 2.12030856e+00 2.36315087e+00
3.17015280e+00 4.23497997e+00 3.32294912e-02 1.38851802e-01
2.24002374e-02 3.85699451e-02 2.50704643e-02 1.79943629e-01
2.89126242e-01 1.90921228e-01 4.56905504e-02 3.61964100e-02
1.73713099e-01 1.00875241e-02 8.16199377e-01 7.76442664e-01
2.23557336e-01]
[ 6.91666667e+00 2.25307183e+02 2.61111111e+00 1.95093981e+02
2.78583333e+02 1.23142325e+04 5.09377058e-03 1.83440117e-02
4.99428623e+00 2.50000000e-02 2.06944444e+00 2.37500000e+00
2.48611111e+00 3.40277778e+00 4.16666667e-02 5.55555556e-02
-1.73472348e-17 2.77777778e-02 5.55555556e-02 2.77777778e-02
1.11111111e-01 6.38888889e-01 2.77777778e-02 1.38888889e-02
1.38888889e-02 1.30104261e-17 9.86111111e-01 7.36111111e-01
2.63888889e-01]
[ 1.10000000e+01 3.01400198e+03 1.50000000e+01 2.29990417e+03
5.77000000e+02 5.35723778e+04 2.80784550e-03 2.15663890e-02
3.81914478e-01 0.00000000e+00 2.00000000e+00 2.00000000e+00
1.00000000e+00 8.00000000e+00 0.00000000e+00 5.00000000e-01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
5.00000000e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 1.00000000e+00 5.00000000e-01
5.00000000e-01]
[ 4.86342229e+00 1.77626788e+02 1.34065934e+00 1.05527518e+02
9.92605965e+01 4.34306889e+03 6.80965548e-03 2.14584490e-02
8.25797938e+00 5.40031397e-02 2.15855573e+00 2.39089482e+00
3.01726845e+00 3.65934066e+00 3.61067504e-02 1.30298273e-01
3.13971743e-03 2.98273155e-02 1.25588697e-02 6.90737834e-02
1.97802198e-01 4.41130298e-01 4.55259027e-02 3.45368917e-02
2.66875981e-02 3.13971743e-03 9.70172684e-01 7.66091052e-01
2.33908948e-01]
[ 6.85123967e+00 2.23415936e+02 2.29338843e+00 1.93528478e+02
1.64049587e+02 7.41594639e+03 6.53738660e-03 2.02325121e-02
5.16682694e+00 3.63636364e-02 2.16115702e+00 2.28925620e+00
2.80991736e+00 3.44628099e+00 2.89256198e-02 9.09090909e-02
4.13223140e-03 4.54545455e-02 4.54545455e-02 5.37190083e-02
1.15702479e-01 5.28925620e-01 3.71900826e-02 4.95867769e-02
4.13223140e-03 4.13223140e-03 9.91735537e-01 7.68595041e-01
2.31404959e-01]
[ 2.74824952e+00 1.00268631e+02 5.53150859e-01 3.65903439e+01
3.26989179e+01 1.15670886e+03 9.20109170e-03 2.52780038e-02
9.51099189e+00 5.55060471e-02 2.12412476e+00 2.38415022e+00
3.15085933e+00 3.92520687e+00 3.81922342e-02 1.52450668e-01
7.32017823e-03 2.60980267e-02 2.13239975e-02 1.52768937e-01
2.76575430e-01 2.42838956e-01 4.32845321e-02 3.91470401e-02
1.31444940e-01 3.81922342e-03 8.64735837e-01 7.40292807e-01
2.59707193e-01]
[ 1.48000000e+01 1.12191581e+03 4.80000000e+00 6.74591667e+02
4.78400000e+02 2.32310689e+04 6.77737780e-03 2.03073056e-02
4.29149073e+00 -6.93889390e-18 1.90000000e+00 2.10000000e+00
1.70000000e+00 4.90000000e+00 1.00000000e-01 1.00000000e-01
3.46944695e-18 2.00000000e-01 3.46944695e-18 -2.77555756e-17
0.00000000e+00 4.00000000e-01 -6.93889390e-18 2.00000000e-01
2.77555756e-17 8.67361738e-19 1.00000000e+00 9.00000000e-01
1.00000000e-01]]
Dataset A silhouette_score: 0.564190293354119
Administrative Administrative_Duration ... Weekend__True Cluster Number
0 0 0.0 ... 0 1
1 0 0.0 ... 0 1
2 0 0.0 ... 0 1
3 0 0.0 ... 0 1
4 0 0.0 ... 1 1
[5 rows x 30 columns]
Administrative int64
Administrative_Duration float64
Informational int64
Informational_Duration float64
ProductRelated int64
ProductRelated_Duration float64
BounceRates float64
ExitRates float64
PageValues float64
SpecialDay float64
OperatingSystems int64
Browser int64
Region int64
TrafficType int64
Month__Aug uint8
Month__Dec uint8
Month__Feb uint8
Month__Jul uint8
Month__June uint8
Month__Mar uint8
Month__May uint8
Month__Nov uint8
Month__Oct uint8
Month__Sep uint8
VisitorType__New_Visitor uint8
VisitorType__Other uint8
VisitorType__Returning_Visitor uint8
Weekend__False uint8
Weekend__True uint8
Cluster Number int32
dtype: object
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-cf4258f963fa> in <module>()
3 print(df_kmeansA.head())
4 print(df_kmeansA.dtypes)
----> 5 kmeans_predict_trainA = kmeans.predict(df_kmeansA)
6 df_kmeansA['Cluster Number'] = kmeans_predict_trainA
7
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/_kmeans.py in _check_test_data(self, X)
815 raise ValueError("Incorrect number of features. "
816 "Got %d features, expected %d" % (
--> 817 n_features, expected_n_features))
818
819 return X
ValueError: Incorrect number of features. Got 30 features, expected 29
it seems to be expecting the dataset prior to one hot encoding
是不是;如果是这样,它会要求 17 个特征,而不是 29 个:
ValueError: Incorrect number of features. Got 30 features, expected 29
因此,它抱怨比预期多了一项功能;仔细观察你的打印输出,很明显
的结果
print(df_kmeansA.head())
是 [5 rows x 30 columns]
的打印输出,其中包含一列 Cluster Number
。然而,您的 KMeans 配备了 data_processed_A_fit
,它有一个
===================
Shape of processed data:
(12330, 29)
并且没有 Cluster Number
列。
显然,尽管您设置了 data_processed_A_fit = data_processed_A
和 df_kmeansA = data_processed_A
,但此处未显示一段代码,您在 Cluster Number
列中添加了 data_processed_A
数据帧,因此出现错误。
我正在对具有某些分类特征的数据集使用 K 均值聚类。我有一些对非分类数据进行操作的旧代码以及执行 fit 的顺序,然后 predict 按预期工作。
所以现在我正在修改该工作代码以处理具有某些分类特征的数据集,因此需要一个热编码。这是一切都有点梨形的地方。
似乎 predict 方法调用需要执行一次热编码之前的旧列数。删除目标列后的数据集有 17 列。然后在一次热编码后它有 29 列。
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from google.colab import drive
drive.mount('/gdrive')
#Change current working directory to gdrive
%cd /gdrive
#Read files
inputFileA = r'/gdrive/My Drive/FilenameA.csv'
trainDataA = pd.read_csv(inputFileA) #creates a dataframe
print(trainDataA.shape)
#Extract training and test data
print("------------------\nShapes before dropping target column")
print(trainDataA.shape)
print(trainDataB.shape)
y_trainA = trainDataA["Revenue"]
X_trainA = trainDataA.drop(["Revenue"], axis=1) #extracting training data without target column
print("------------------\nShapes after dropping target column")
print(X_trainA.shape)
#categorical features of dataset A
categoricalFeaturesA = ["Month", "VisitorType","Weekend"]
data_processed_A = pd.get_dummies(X_trainA,prefix_sep="__",columns=categoricalFeaturesA)
print("---------------\nDataset A\n",data_processed_A.head())
data_processed_A.to_csv(r'/gdrive/My Drive/data_processed_A.csv')
#K-Means Clustering ========================================================================
#Default Mode - K=8
kmeans = KMeans()
data_processed_A_fit = data_processed_A
print("===================")
print("Shape of processed data: \n", data_processed_A_fit.shape)
data_processed_A_fit.to_csv(r'/gdrive/My Drive/data_processed_A_after_fit.csv')
kmeans.fit(data_processed_A_fit)
print("Online shoppers dataset");
print("\n============\nDataset A labels")
print(kmeans.labels_)
print("==============\n\nDataset A Clusters")
print(kmeans.cluster_centers_)
#Print Silhouette measure
print("\nDataset A silhouette_score:",silhouette_score(data_processed_A, kmeans.labels_))
df_kmeansA = data_processed_A
print(df_kmeansA.head())
print(df_kmeansA.dtypes)
kmeans_predict_trainA = kmeans.predict(df_kmeansA)
它在最后一行抛出错误:
ValueError: Incorrect number of features. Got 30 features, expected 29
所以它似乎在一次热编码之前期待数据集,但我不明白为什么。
编辑:根据要求,这里是输出。
(12330, 18)
------------------
Shapes before dropping target column
(12330, 18)
------------------
Shapes after dropping target column
(12330, 17)
---------------
Dataset A
Administrative Administrative_Duration ... Weekend__False Weekend__True
0 0 0.0 ... 1 0
1 0 0.0 ... 1 0
2 0 0.0 ... 1 0
3 0 0.0 ... 1 0
4 0 0.0 ... 0 1
[5 rows x 29 columns]
===================
Shape of processed data:
(12330, 29)
Online shoppers dataset
============
Dataset A labels
[1 1 1 ... 1 1 1]
==============
Dataset A Clusters
[[ 3.81805930e+00 1.38862225e+02 9.64959569e-01 6.74071040e+01
5.82958221e+01 2.41869720e+03 7.61833487e-03 2.22516393e-02
8.26725184e+00 5.21563342e-02 2.12398922e+00 2.27021563e+00
3.19204852e+00 3.92318059e+00 3.70619946e-02 1.35444744e-01
4.71698113e-03 3.77358491e-02 1.95417790e-02 1.04447439e-01
2.58086253e-01 3.29514825e-01 4.38005391e-02 2.96495957e-02
6.13207547e-02 1.34770889e-03 9.37331536e-01 7.85040431e-01
2.14959569e-01]
[ 1.30855956e+00 4.07496939e+01 2.02343866e-01 1.04729641e+01
1.08400831e+01 2.55187795e+02 3.36811071e-02 5.91174011e-02
3.49331537e+00 6.83281412e-02 2.12030856e+00 2.36315087e+00
3.17015280e+00 4.23497997e+00 3.32294912e-02 1.38851802e-01
2.24002374e-02 3.85699451e-02 2.50704643e-02 1.79943629e-01
2.89126242e-01 1.90921228e-01 4.56905504e-02 3.61964100e-02
1.73713099e-01 1.00875241e-02 8.16199377e-01 7.76442664e-01
2.23557336e-01]
[ 6.91666667e+00 2.25307183e+02 2.61111111e+00 1.95093981e+02
2.78583333e+02 1.23142325e+04 5.09377058e-03 1.83440117e-02
4.99428623e+00 2.50000000e-02 2.06944444e+00 2.37500000e+00
2.48611111e+00 3.40277778e+00 4.16666667e-02 5.55555556e-02
-1.73472348e-17 2.77777778e-02 5.55555556e-02 2.77777778e-02
1.11111111e-01 6.38888889e-01 2.77777778e-02 1.38888889e-02
1.38888889e-02 1.30104261e-17 9.86111111e-01 7.36111111e-01
2.63888889e-01]
[ 1.10000000e+01 3.01400198e+03 1.50000000e+01 2.29990417e+03
5.77000000e+02 5.35723778e+04 2.80784550e-03 2.15663890e-02
3.81914478e-01 0.00000000e+00 2.00000000e+00 2.00000000e+00
1.00000000e+00 8.00000000e+00 0.00000000e+00 5.00000000e-01
0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
5.00000000e-01 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00 0.00000000e+00 1.00000000e+00 5.00000000e-01
5.00000000e-01]
[ 4.86342229e+00 1.77626788e+02 1.34065934e+00 1.05527518e+02
9.92605965e+01 4.34306889e+03 6.80965548e-03 2.14584490e-02
8.25797938e+00 5.40031397e-02 2.15855573e+00 2.39089482e+00
3.01726845e+00 3.65934066e+00 3.61067504e-02 1.30298273e-01
3.13971743e-03 2.98273155e-02 1.25588697e-02 6.90737834e-02
1.97802198e-01 4.41130298e-01 4.55259027e-02 3.45368917e-02
2.66875981e-02 3.13971743e-03 9.70172684e-01 7.66091052e-01
2.33908948e-01]
[ 6.85123967e+00 2.23415936e+02 2.29338843e+00 1.93528478e+02
1.64049587e+02 7.41594639e+03 6.53738660e-03 2.02325121e-02
5.16682694e+00 3.63636364e-02 2.16115702e+00 2.28925620e+00
2.80991736e+00 3.44628099e+00 2.89256198e-02 9.09090909e-02
4.13223140e-03 4.54545455e-02 4.54545455e-02 5.37190083e-02
1.15702479e-01 5.28925620e-01 3.71900826e-02 4.95867769e-02
4.13223140e-03 4.13223140e-03 9.91735537e-01 7.68595041e-01
2.31404959e-01]
[ 2.74824952e+00 1.00268631e+02 5.53150859e-01 3.65903439e+01
3.26989179e+01 1.15670886e+03 9.20109170e-03 2.52780038e-02
9.51099189e+00 5.55060471e-02 2.12412476e+00 2.38415022e+00
3.15085933e+00 3.92520687e+00 3.81922342e-02 1.52450668e-01
7.32017823e-03 2.60980267e-02 2.13239975e-02 1.52768937e-01
2.76575430e-01 2.42838956e-01 4.32845321e-02 3.91470401e-02
1.31444940e-01 3.81922342e-03 8.64735837e-01 7.40292807e-01
2.59707193e-01]
[ 1.48000000e+01 1.12191581e+03 4.80000000e+00 6.74591667e+02
4.78400000e+02 2.32310689e+04 6.77737780e-03 2.03073056e-02
4.29149073e+00 -6.93889390e-18 1.90000000e+00 2.10000000e+00
1.70000000e+00 4.90000000e+00 1.00000000e-01 1.00000000e-01
3.46944695e-18 2.00000000e-01 3.46944695e-18 -2.77555756e-17
0.00000000e+00 4.00000000e-01 -6.93889390e-18 2.00000000e-01
2.77555756e-17 8.67361738e-19 1.00000000e+00 9.00000000e-01
1.00000000e-01]]
Dataset A silhouette_score: 0.564190293354119
Administrative Administrative_Duration ... Weekend__True Cluster Number
0 0 0.0 ... 0 1
1 0 0.0 ... 0 1
2 0 0.0 ... 0 1
3 0 0.0 ... 0 1
4 0 0.0 ... 1 1
[5 rows x 30 columns]
Administrative int64
Administrative_Duration float64
Informational int64
Informational_Duration float64
ProductRelated int64
ProductRelated_Duration float64
BounceRates float64
ExitRates float64
PageValues float64
SpecialDay float64
OperatingSystems int64
Browser int64
Region int64
TrafficType int64
Month__Aug uint8
Month__Dec uint8
Month__Feb uint8
Month__Jul uint8
Month__June uint8
Month__Mar uint8
Month__May uint8
Month__Nov uint8
Month__Oct uint8
Month__Sep uint8
VisitorType__New_Visitor uint8
VisitorType__Other uint8
VisitorType__Returning_Visitor uint8
Weekend__False uint8
Weekend__True uint8
Cluster Number int32
dtype: object
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-cf4258f963fa> in <module>()
3 print(df_kmeansA.head())
4 print(df_kmeansA.dtypes)
----> 5 kmeans_predict_trainA = kmeans.predict(df_kmeansA)
6 df_kmeansA['Cluster Number'] = kmeans_predict_trainA
7
1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/cluster/_kmeans.py in _check_test_data(self, X)
815 raise ValueError("Incorrect number of features. "
816 "Got %d features, expected %d" % (
--> 817 n_features, expected_n_features))
818
819 return X
ValueError: Incorrect number of features. Got 30 features, expected 29
it seems to be expecting the dataset prior to one hot encoding
是不是;如果是这样,它会要求 17 个特征,而不是 29 个:
ValueError: Incorrect number of features. Got 30 features, expected 29
因此,它抱怨比预期多了一项功能;仔细观察你的打印输出,很明显
的结果print(df_kmeansA.head())
是 [5 rows x 30 columns]
的打印输出,其中包含一列 Cluster Number
。然而,您的 KMeans 配备了 data_processed_A_fit
,它有一个
===================
Shape of processed data:
(12330, 29)
并且没有 Cluster Number
列。
显然,尽管您设置了 data_processed_A_fit = data_processed_A
和 df_kmeansA = data_processed_A
,但此处未显示一段代码,您在 Cluster Number
列中添加了 data_processed_A
数据帧,因此出现错误。