编码数据集数组时遇到问题

Question

数据集：https://docs.google.com/spreadsheets/d/1jlKp7JR9Ewujv445QgT1kZpH5868fhXFFrA3ovWxS_0/edit?usp=sharing

我一直在尝试将集成方法从 sklearn 部署到我上面链接的一个小数据集。出于某种原因，我一直收到此错误。

ValueError: y should be a 1d array, got an array of shape (9, 56) instead.

这是代码：

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from numpy import array

from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

cbdata = pd.read_excel("C:/Users/Andrew/cbupdated2.xlsx")

print(cbdata)
print(cbdata.describe())
df = cbdata.columns

print(df)

x = cbdata
y = cbdata.fundingstatus

xshape = x.shape
yshape = y.shape

shapes = xshape, yshape
print(shapes)

size = x.size, y.size
print(size)


###Problem ENCODING DATA
      
##Label encoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(x)
print(integer_encoded)


scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)
print(X_scaled)

###Problm block
ec = OneHotEncoder()


X_encoded = cbdata.apply(lambda col: ec.fit_transform(col.astype(str)), axis=0, result_type='expand')
X_encoded2 = X_encoded.shape

print(X_encoded2)

关于让编码器工作的任何帮助and/or建议，以便我可以使用集成方法？

Answer 1

LabelEncoder 用于编码目标变量，而不是特征。另见 this post

您应该在要转换的分类列上使用 OrdinalEncoder，因为我看到您的一些列有浮点数和字符串。因此，例如转换 company 和 industry ：

from sklearn.preprocessing import OrdinalEncoder

Cols = ["company","industry"]

integer_encoded = OrdinalEncoder().fit_transform(x[Cols])

编码数据集数组时遇到问题

Having trouble encoding dataset array

python

arrays

encoding

machine-learning

scikit-learn