编码数据集数组时遇到问题
Having trouble encoding dataset array
数据集:https://docs.google.com/spreadsheets/d/1jlKp7JR9Ewujv445QgT1kZpH5868fhXFFrA3ovWxS_0/edit?usp=sharing
我一直在尝试将集成方法从 sklearn 部署到我上面链接的一个小数据集。出于某种原因,我一直收到此错误。
ValueError: y should be a 1d array, got an array of shape (9, 56) instead.
这是代码:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from numpy import array
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
cbdata = pd.read_excel("C:/Users/Andrew/cbupdated2.xlsx")
print(cbdata)
print(cbdata.describe())
df = cbdata.columns
print(df)
x = cbdata
y = cbdata.fundingstatus
xshape = x.shape
yshape = y.shape
shapes = xshape, yshape
print(shapes)
size = x.size, y.size
print(size)
###Problem ENCODING DATA
##Label encoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(x)
print(integer_encoded)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)
print(X_scaled)
###Problm block
ec = OneHotEncoder()
X_encoded = cbdata.apply(lambda col: ec.fit_transform(col.astype(str)), axis=0, result_type='expand')
X_encoded2 = X_encoded.shape
print(X_encoded2)
关于让编码器工作的任何帮助and/or建议,以便我可以使用集成方法?
LabelEncoder
用于编码目标变量,而不是特征。另见 this post
您应该在要转换的分类列上使用 OrdinalEncoder
,因为我看到您的一些列有浮点数和字符串。因此,例如转换 company
和 industry
:
from sklearn.preprocessing import OrdinalEncoder
Cols = ["company","industry"]
integer_encoded = OrdinalEncoder().fit_transform(x[Cols])
数据集:https://docs.google.com/spreadsheets/d/1jlKp7JR9Ewujv445QgT1kZpH5868fhXFFrA3ovWxS_0/edit?usp=sharing
我一直在尝试将集成方法从 sklearn 部署到我上面链接的一个小数据集。出于某种原因,我一直收到此错误。
ValueError: y should be a 1d array, got an array of shape (9, 56) instead.
这是代码:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from numpy import array
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
cbdata = pd.read_excel("C:/Users/Andrew/cbupdated2.xlsx")
print(cbdata)
print(cbdata.describe())
df = cbdata.columns
print(df)
x = cbdata
y = cbdata.fundingstatus
xshape = x.shape
yshape = y.shape
shapes = xshape, yshape
print(shapes)
size = x.size, y.size
print(size)
###Problem ENCODING DATA
##Label encoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(x)
print(integer_encoded)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(x)
print(X_scaled)
###Problm block
ec = OneHotEncoder()
X_encoded = cbdata.apply(lambda col: ec.fit_transform(col.astype(str)), axis=0, result_type='expand')
X_encoded2 = X_encoded.shape
print(X_encoded2)
关于让编码器工作的任何帮助and/or建议,以便我可以使用集成方法?
LabelEncoder
用于编码目标变量,而不是特征。另见 this post
您应该在要转换的分类列上使用 OrdinalEncoder
,因为我看到您的一些列有浮点数和字符串。因此,例如转换 company
和 industry
:
from sklearn.preprocessing import OrdinalEncoder
Cols = ["company","industry"]
integer_encoded = OrdinalEncoder().fit_transform(x[Cols])