如何重塑测试数据框,使维度与训练和预测工作中使用的维度相同?

How to reshape Test dataframe so that dimension is same as used in Training and Prediction works?

需要知道需要进行哪些更改,以便测试数据具有与预测工作训练相同级别的编码列。它现在因维度错误而失败。

在论坛中查看过类似的查询..

import pandas as pd
import sklearn
from sklearn.linear_model import LinearRegression
# initialize list of lists 
data = [[1001, 10,'Male',38], [2001, 15,'Male',50], [2004, 12,'FeMale',40]] 

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['StudentId', 'Age','Gender','Weight']) 

#Define y , X, test and train

y=df['Weight']
X=df[['StudentId','Age','Gender']] 
# One-hot encode the data using pandas get_dummies
X = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

X_test.head()
----
StudentId   Age Gender_FeMale   Gender_Male
1   2001    15  0   1
---
# linear regression model creation
lm_model = LinearRegression()
lm_model.fit(X_train,y_train)

# predictions
lm_model.predict(X_test)

---works fine till now..--
When we now create a single test record and test it fails as the dimension mismatch happens,,, Does one have to manually add another encoded dimension or some clean approach is there...please advice..

sample_testdata=[[4001, 10,'FeMale']]
# Create the pandas DataFrame 
sample_testDF= pd.DataFrame(sample_testdata, columns = ['StudentId', 'Age','Gender']) 

sample_testDF_encoded=pd.get_dummies(sample_testDF)
-----
    StudentId   Age Gender_FeMale
0   4001    10  1

---

lm_model.predict(sample_testDF_encoded)

--Error----

ValueError: shapes (1,3) and (4,) not aligned: 3 (dim 1) != 4 (dim 0)

对单个测试记录的预测失败,因为 get_dummies 产生了一个列...

print(X_train.columns)

这会产生:

Index(['StudentId', 'Age', 'Gender_FeMale', 'Gender_Male'], dtype='object')

print(sample_testDF_encoded.columns)

这会产生:

Index(['StudentId', 'Age', 'Gender_FeMale'], dtype='object')

所以,问题是你的一个热编码器在主数据中为性别设置了两列,其中性别值为 Male 或 FeMale(如果你的数据包含女性,它将 Gender_FeMale 编码为 1,它编码 Gender_Male 到 1 如果您的数据包含男性) 但是您的 sample_testDF 只包含一个值,即 FeMale。所以,一个热门编码器这次不会为性别制作 2 列。所以这是失误比赛

所以你的测试数据应该是这样的:

sample_testdata=[[4001, 10,1, 0]]
# Create the pandas DataFrame 
sample_testDF= pd.DataFrame(sample_testdata, columns = ['StudentId', 'Age','Gender_FeMale', 'Gender_Male']) 

更改这些拖线将消除错误并为您提供预测结果

正如您在评论中所说,示例数据是由用户输入的,因此您必须按照我在该评论中的回复将其转换。您可以创建一个转换后的数据列表,而不是使用该转换后的列表制作 Dataframe:

sample_testdata= [[4001, 10,'FeMale']]

convertedDataList = []
for data in sample_testdata:
    if data[2] == 'FeMale':
        data[2] = 1
        data.append(0)
    else:
        data[2] = 0
        data.append(1) 
    convertedDataList.append(data)


# Create the pandas DataFrame  using convertedDataList
sample_testDF= pd.DataFrame(convertedDataList, columns = ['StudentId', 'Age','Gender_FeMale', 'Gender_Male']) 

您收到此错误是因为

sample_testdata=[[4001, 10,'FeMale']]
sample_testDF= pd.DataFrame(sample_testdata, columns = ['StudentId', 'Age','Gender']) 
sample_testDF_encoded=pd.get_dummies(sample_testDF)
gives the output :
 StudentId     Age    Gender_FeMale
    4001        10      1

但是您的测试用例需要多一列 Gender_Male,因为您的训练数据集有 Gender_Male 列,所以此列在此处给出列不匹配 所以您需要执行以下任一选项:

sample_testdata=[[4001, 10,'FeMale'],[4001, 10,'Male']]
OR
sample_testDF= pd.DataFrame(sample_testdata, columns = ['StudentId', 'Age','Gender_FeMale', 'Gender_Male'])

这个给了我以下输出:

sample_testdata=[[4001, 10,'FeMale'],[4001, 10,'Male']]
lm_model.predict(sample_testDF_encoded)
array([43.98202214, 43.98201816])

为了更好的用户体验,您可以添加多个类别并将其转换为 pandas categorical 变量,在用户输入之后和使用一个热编码/ get_dummies 之前。类似于:

# Sample input from user
sample_testdata = [[4001, 10,'FeMale']]
sample_testDF = pd.DataFrame(sample_testdata, columns = ['StudentId', 'Age','Gender'])

# Add categories and convert to categorical variable
sample_testDF['Gender'] = pd.Categorical(sample_testDF['Gender'], 
                                         categories = ["Male", "FeMale"])

# Create dummies and index columns based on your X_test/ X_train
sample_testDF_dum = pd.get_dummies(sample_testDF)[X_test.columns]
sample_testDF_dum

#    StudentId  Age Gender_FeMale   Gender_Male
# 0       4001  10  1               0