将分类数据转换为 Python 中的数值数据

Question

我有一个数据集。其中一列 - "Keyword" - 包含分类数据。我尝试使用的机器学习算法只需要数字数据。我想将 "Keyword" 列转换为数值 - 我该怎么做？使用自然语言处理？词袋？

我尝试了以下但得到了 ValueError: Expected 2D array, got 1D array instead。

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
dataset['Keyword'] = count_vector.fit_transform(dataset['Keyword'])
from sklearn.model_selection import train_test_split
y=dataset['C']
x=dataset(['Keyword','A','B'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

Answer 1

您可能想要使用编码器。 LabelEncoder 和 OneHotEncoder 是最常用和最受欢迎的一种。两者都作为 sklearn 库的一部分提供。

LabelEncoder 可用于将分类数据转换为整数：

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
x = ['Apple', 'Orange', 'Apple', 'Pear']
y = label_encoder.fit_transform(x)
print(y)

array([0, 1, 0, 2])

这会将 ['Apple'、'Orange'、'Apple'、'Pear'] 的列表转换为 [0, 1, 0, 2]，每个整数对应到一个项目。这对于 ML 来说并不总是理想的，因为整数具有不同的数值，这表明一个比另一个大，例如 Pear > Apple，但事实并非如此。为了不引入此类问题，您需要使用 OneHotEncoder。

OneHotEncoder 可用于将分类数据转换为一个热编码数组。使用 OneHotEncoder 对先前定义的 y 进行编码将导致：

from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
onehot_encoded = onehot_encoder.fit_transform(y)
print(onehot_encoded)

[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]

其中 x 的每个元素都变成一个零数组，只有一个 1 对元素的类别进行编码。

有关如何在 DataFrame 上使用它的简单教程 can be found here。

将分类数据转换为 Python 中的数值数据

Convert categorical data into numerical data in Python

python

encoding

nlp

machine-learning

categorical-data