如何对 python 中数据框中的列进行一次热编码
how to make one hot encoding to column in data frame in python
我有包含教育水平分类列的数据集
初始值为0,nan,高中,研究生院,大学
我已经清理了数据并将其转换为以下值
0->其他
1-> 高中
2-> 研究生院
3-> 大学
在同一列中,现在我想将此列热编码为4列
我尝试使用 scikit learn 如下
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df_csv['EDUCATION'])
print(onehot_encoded)
我遇到了这个错误
ValueError: Expected 2D array, got 1D array instead:
array=[3 3 3 ... 3 1 3].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
您需要将 sparse
设置为 False
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y_train = np.random.randint(0,4,100)[:,None]
y_train = onehot_encoder.fit_transform(y_train)
或者,你也可以这样做
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
y_train = np.random.randint(0,4,100)
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y = encoder.transform(y_train)
y_train = np_utils.to_categorical(encoded_y)
对于您的具体情况,如果您重塑底层数组(连同设置 sparse=False
),它将为您提供单热编码数组:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'EDUCATION':['high school','high school','high school',
'university','university','university',
'graduate school', 'graduate school','graduate school',
'others','others','others']})
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoder.fit_transform(df['EDUCATION'].to_numpy().reshape(-1,1))
>>>
array([[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.]])
我认为最直接的方法是使用 pandas.get_dummies
:
pd.get_dummies(df['EDUCATION'])
我有包含教育水平分类列的数据集 初始值为0,nan,高中,研究生院,大学 我已经清理了数据并将其转换为以下值
0->其他 1-> 高中 2-> 研究生院 3-> 大学
在同一列中,现在我想将此列热编码为4列
我尝试使用 scikit learn 如下
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(df_csv['EDUCATION'])
print(onehot_encoded)
我遇到了这个错误
ValueError: Expected 2D array, got 1D array instead:
array=[3 3 3 ... 3 1 3].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
您需要将 sparse
设置为 False
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y_train = np.random.randint(0,4,100)[:,None]
y_train = onehot_encoder.fit_transform(y_train)
或者,你也可以这样做
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
y_train = np.random.randint(0,4,100)
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y = encoder.transform(y_train)
y_train = np_utils.to_categorical(encoded_y)
对于您的具体情况,如果您重塑底层数组(连同设置 sparse=False
),它将为您提供单热编码数组:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'EDUCATION':['high school','high school','high school',
'university','university','university',
'graduate school', 'graduate school','graduate school',
'others','others','others']})
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoder.fit_transform(df['EDUCATION'].to_numpy().reshape(-1,1))
>>>
array([[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[0., 0., 0., 1.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.],
[0., 0., 1., 0.]])
我认为最直接的方法是使用 pandas.get_dummies
:
pd.get_dummies(df['EDUCATION'])