训练数据和测试数据之间的 OneHotEncoding 映射问题
OneHotEncoding mapping issue between training data and test data
我已经通过 sklearn OneHotEncoding 方法转换了训练和测试数据集。但是,转换后的结果具有不同的类型形状。所以不可能应用于逻辑回归等其他算法。
如何根据训练数据集的形状对测试数据进行整形?
此致,克里斯
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
def data_transformation(data, dummy):
le = LabelEncoder()
# Encoding the columns with multiple categorical levels
for col1 in dummy:
le.fit(data[col1])
data[col1] = le.transform(data[col1])
dummy_data = np.array(data[dummy])
enc = OneHotEncoder()
enc.fit(dummy_data)
dummy_data = enc.transform(dummy_data).toarray()
if __name__ == '__main__':
data = pd.read_csv('train.data', delimiter=',')
data_test = pd.read_csv('test.data', delimiter=',')
dummy_columns = ['Column1', 'Column2']
data = data_transformation(data, dummy_columns)
data_test = data_transformation(data_test, dummy_columns)
# result
# data shape : (200000, 71 )
# data_test shape : ( 15000, 32)
非常感谢,Vivek!由于您的帮助,我已经解决了这个问题。
def data_transformation2(data, data_test, dummy):
le = LabelEncoder()
# Encoding the columns with multiple categorical levels
for col in dummy:
le.fit(data[col])
data[col] = le.transform(data[col])
for col in dummy:
le.fit(data_test[col])
data_test[col] = le.transform(data_test[col])
enc = OneHotEncoder()
dummy_data = np.array(data[dummy])
dummy_data_test = np.array(data_test[dummy])
enc.fit(dummy_data)
dummy_data = enc.transform(dummy_data).toarray()
dummy_data_test = enc.transform(dummy_data_test).toarray()
print(dummy_data.shape)
print(dummy_data_test.shape)
我已经通过 sklearn OneHotEncoding 方法转换了训练和测试数据集。但是,转换后的结果具有不同的类型形状。所以不可能应用于逻辑回归等其他算法。
如何根据训练数据集的形状对测试数据进行整形?
此致,克里斯
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
def data_transformation(data, dummy):
le = LabelEncoder()
# Encoding the columns with multiple categorical levels
for col1 in dummy:
le.fit(data[col1])
data[col1] = le.transform(data[col1])
dummy_data = np.array(data[dummy])
enc = OneHotEncoder()
enc.fit(dummy_data)
dummy_data = enc.transform(dummy_data).toarray()
if __name__ == '__main__':
data = pd.read_csv('train.data', delimiter=',')
data_test = pd.read_csv('test.data', delimiter=',')
dummy_columns = ['Column1', 'Column2']
data = data_transformation(data, dummy_columns)
data_test = data_transformation(data_test, dummy_columns)
# result
# data shape : (200000, 71 )
# data_test shape : ( 15000, 32)
非常感谢,Vivek!由于您的帮助,我已经解决了这个问题。
def data_transformation2(data, data_test, dummy):
le = LabelEncoder()
# Encoding the columns with multiple categorical levels
for col in dummy:
le.fit(data[col])
data[col] = le.transform(data[col])
for col in dummy:
le.fit(data_test[col])
data_test[col] = le.transform(data_test[col])
enc = OneHotEncoder()
dummy_data = np.array(data[dummy])
dummy_data_test = np.array(data_test[dummy])
enc.fit(dummy_data)
dummy_data = enc.transform(dummy_data).toarray()
dummy_data_test = enc.transform(dummy_data_test).toarray()
print(dummy_data.shape)
print(dummy_data_test.shape)